Post-mortem

Summary

On November 5th 2025, we experienced an outage that affected a significant number of compute instances in CH-DK-2.

The incident originated from a power feed interruption at the facility hosting our infrastructure. While power redundancy protected most systems, a subset of equipment did not behave as designed when the feed failed.

Despite this, all our services and APIs remained available throughout the event.

What Happened

During a scheduled maintenance at the CH-DK-2 facility, an unexpected interruption occurred on one of the two independent power feeds supplying several of our racks. This interruption lasted long enough for certain systems relying on that feed to shut down.

Equipment with correctly functioning redundant power configurations continued operating normally. However, a portion of our hardware encountered issues due to power imbalances, overload conditions, or incorrect cabling that prevented the other feed to sustain them.

The facility’s operations team restored stable power, after which our systems were progressively recovered.

Due to confidentiality obligations, we cannot share additional information regarding the facility or the underlying power incident.

Why and how were we affected by this event?

Our infrastructure is designed to withstand the loss of a power feed by relying on fully redundant power supplies, connected to two independent feeds. However, during this incident, some equipment did not tolerate the feed loss as expected and went offline. The main impacts were:

A compute rack hosting numerous hypervisor nodes:
When the power failed on one feed, the load shifted to the remaining feed, causing an overload and a fuse trip. We later determined that this was due to improper power balancing between phases. A physical onsite intervention was required to restore the service to its nominal state.

One of our core edge routers:
Although these devices are deployed in an N+1 redundancy configuration, we discovered that this specific router had been incorrectly cabled, resulting in a non-redundant power setup. When its primary feed failed, it went offline. Traffic was automatically rerouted, so the impact on internet connectivity was brief and minimal.

Following these issues, we dispatched personnel to the CH-DK-2 datacenter to assist with mitigation and to complete the recovery process.

What We Learned

Redundancy doesn’t remove the need for regular cabling and power checks. These inspections are essential to ensure systems behave as expected when a failure occurs.

Clear operational expectations and documentation with our datacenter partner are essential to maintain alignment on redundancy guarantees.

What We’ve Changed

We fixed the impacted equipment and audited all racks in CH-DK-2 to verify proper power balancing. We have initiated the same checks across every zone, which we expect to be completed in the coming days. We will be performing these checks in a recurring schedule.

We are updating our QA procedures for power distribution and cabling for any rack entering production.

In parallel, we’re working with our datacenter partner to clarify how their redundancy model operates in practice and to ensure our expectations are aligned, especially during maintenance events.

Moving Forward

We recognize the impact this event had on your operations.

This event highlighted areas where our redundancy validation and operational safeguards can be improved, and we have already taken concrete steps to address them. We are also working with our datacenter partner to ensure better alignment and reliability moving forward.

We sincerely apologize for the disruption and thank you for your trust and patience. If you have any questions or concerns, our support team is here to help.

The Exoscale team

November 14, 2025 · 13:57 CET

Resolved

Incident has been resolved.

Only the compute service has been impacted with the loss of multiple hypervisors due to an electrical incident.

We will provide a post-mortem once the root cause has been established by our datacenter partner.

November 5, 2025 · 16:06 CET

Monitoring

All services are nominal. We are monitoring the situation.

November 5, 2025 · 14:27 CET

Update

We are tracking for any isolated instance start failures.

All services are confirmed to be nominal

November 5, 2025 · 13:36 CET

Update

Most affected instances have been restored.

November 5, 2025 · 13:22 CET

Update

We are close to complete the restore of affected instances

November 5, 2025 · 13:13 CET

Update

We are still working on restoring the affected instances. Equipment recover is taking more time than expected.

November 5, 2025 · 12:50 CET

Update

We are still working on bringing back up all the affected services and customer instances

November 5, 2025 · 12:40 CET

Investigating

Connectivity has been restored to nominal N+1 redundancy

November 5, 2025 · 12:31 CET

Investigating

Some impacted instances have been started back. We are still to restore all services

November 5, 2025 · 12:25 CET

Investigating

We are bringing back up all the affected customer instances. ETA ~10 min

November 5, 2025 · 12:01 CET

Update

Affected equipment are currently booting. External internet connectivity is working in reduced redundancy

November 5, 2025 · 11:57 CET

Investigating

It seems that electricity is coming back on the affected equipments. We are working on recovering the affected services

November 5, 2025 · 11:50 CET

Investigating

We confirm that we lost core routers and hypervisors. We are continuing to asses the impact.

November 5, 2025 · 11:44 CET

Investigating

The datacenter is experiencing an electrical incident. We are currently clarifying the situation with our datacenter provider. Impact is unknown at this stage but part of our infrastructure is down. We are assessing the situation

November 5, 2025 · 11:41 CET

Escalate

Raising to major outage

November 5, 2025 · 11:33 CET

Issue

We are investigating

November 5, 2025 · 11:26 CET

Exoscale
Platform Status

CH-DK-2 incident

Updates

Post-mortem

Summary

What Happened

What We Learned

What We’ve Changed

Moving Forward