[Compute][CH-DK-2] Increased Error rate in Compute services
Updates
Post-mortem
Summary
Between January 2, 2026 05:23 CET and January 4, 2026 01:00 CET, we experienced repeated local storage hangs on a subset of hypervisors in CH-DK-2. As we worked to recover affected workloads, additional hosts of the same hardware generation began showing the same symptoms, resulting in a cascading operational event across part of the fleet.
Impact
Compute: Instances hosted on impacted hypervisors were affected. Some customers experienced extended downtime depending on how quickly their instances could be recovered or migrated.
NLB service: A subset of load balancers experienced transient packet loss during recovery actions and migrations.
What Happened
At 05:23 CET on January 2, one hypervisor in CH-DK-2 experienced a hardware related hang of its local storage array, causing the host to stall while waiting on storage I/O. Our hypervisors use local, per-host storage (enterprise SSD backed) for instance root volumes. Each host’s storage is independent from other hosts.
To restore service, our SRE team force rebooted the affected host. While responding to this incident, additional hypervisors began presenting the same storage hang behavior, first a second, then a third, and later more, creating a broader incident across hosts in that cluster.
Our response focused on stabilizing the fleet and restoring customer workloads:
-
Rebooting impacted hosts to recover from the I/O hang.
-
Updating SSD firmware to the latest available version on affected systems.
-
Migrating instances away from impacted hosts to healthy capacity (online where possible, offline when required).
This took longer than expected because a large number of instances needed to be moved, and in some cases hosts re-entered the failure state during evacuations, requiring additional resets before migrations could complete.
In parallel, the team investigated logs and telemetry to identify a clear failure signature. We did not find a single definitive error pattern, but we observed that the issue appeared limited to one specific hardware generation using the same SSD model.
Given the recurrence risk, we made the decision to remove all hosts matching that generation / disk model from production in CH-DK-2. That evacuation was completed on January 4 at 01:00 CET, which ended the customer impact in that location.
We identified similar hosts running in CH-GVA-2 and AT-VIE-1. To reduce risk, we proactively removed those hosts from production as well, completing that work on January 5 at 02:00 CET, without customer impact.
These hosts are part of an older fleet generation that has been in service for several years. While the evidence strongly suggests a hardware/firmware level issue, the exact root cause is still under investigation. We are conducting deeper analysis and working with our storage vendor to reproduce the behavior and determine the underlying trigger.
Affected customers were notified by email when an instance became unavailable. In this specific failure mode, a limitation in our notification system meant we could not reliably send a follow-up message when instances returned to service. That gap is included in our action items below.
What We Learned
Independent local storage doesn’t eliminate correlated risk. Even though each hypervisor has its own local SSDs, a shared hardware generation + disk model can create a common failure domain.
Reboots restored service but weren’t a durable mitigation. Hosts could recover temporarily after a reset, but some re-hung under load, especially during large-scale evacuations.
Our monitoring didn’t provide enough early warning. We could see the symptom (I/O stall), but we lacked sufficient low-level signals to quickly isolate the trigger and confidently predict which hosts would fail next.
Customer communication must cover “back to normal,” not only “down” events. The inability to notify customers when instances were restored increased uncertainty during recovery.
What We’ve Changed
The following changes have been implemented:
-
We removed from production the affected hardware generation / disk model from production in CH-DK-2, and proactively did the same in CH-GVA-2 and AT-VIE-1.
-
We added clearer alerting for storage I/O stalls / hung queues and improved dashboards so oncall SRE can quickly identify whether multiple hosts share the same risk profile.
Moving Forward
Over the coming days and weeks we are going to move forward with the following action items:
-
Complete RCA and work with the SSD vendor to obtain a definitive explanation, including reproduction steps where possible.
-
Expand preventive screening by adding automated checks to map hosts to hardware generation + disk model + firmware, so we can quickly identify and drain similar risk groups across the fleet.
-
Ensure incident messaging covers both the outage and the resolution, including per-instance “restored” notifications where possible.
We sincerely apologize for the disruption and thank you for your trust and patience. If you have any questions or concerns, our support team is here to help.
The Exoscale team
The situation is stable.
We will close the incident, but the additional monitoring will remain in place.
All instances have been removed from the affected hardware.
We have put additional monitoring in place to ensure we resolve any issues as quickly as possible.
The original faulty hosts have been quarantined and instances have been moved out.
We have identified a few more which we are handling now.
We are nearing the end of the tunnel of the mitigation effort.
Most instance have been migrated out of the affected hardware, a few larger ones remain.
We are still moving all workloads out of the affected hardware, which is a complex and time-consuming process.
We sincerely apologize for any inconveniences.
We are still focusing on moving all workloads out of the affected hardware.
We are still hard at work to move all workloads out of the affected hardware.
We have mitigated the impact on the NLB service, but the underlying compute issue is still present.
Our investigations are still ongoing.
The issue is strongly correlated with a certain class of hypervisors.
They have been isolated which should reduce the impact.
The investigation continues.
Investigating Increased Error rate in Compute services in CH-DK-2
← Back