Connectivity issues
Updates
Post-mortem
Summary
On November 11th 2025, numerous instances in CH-DK-2 were affected by intermittent but nonetheless important packet loss.
The issue was triggered during a planned maintenance on a Top-of-Rack (ToR) switch. While the switch was running with reduced redundancy (N), it unexpectedly began dropping packets.
What Happened
We were performing a routine software upgrade on a ToR switch serving a rack of hypervisors. As part of the maintenance procedure, the rack temporarily operated at redundancy level N instead of N+1, a state we have used safely many times before.
Some time after the maintenance started, we began receiving alerts and customer reports about transient connectivity drops. Because the symptoms were intermittent and our alerting lacked the granularity to surface packet loss at the rack level, it took us longer than we would have liked to correlate the impact with the ongoing maintenance.
Once we confirmed the link between the maintenance and the packet loss, we promptly restored redundancy by bringing the switch back into service as the upgrade completed. Packet loss stopped as soon as redundancy returned to N+1.
We then investigated the switch to understand why it had been discarding packets. Unfortunately, no clear root cause could be determined retrospectively. Traffic levels, physical links, and device health metrics all remained within normal operating ranges during the maintenance window.
What We Learned
Our current observability did not provide enough granularity to quickly pinpoint localized packet loss. We need better visibility into per-rack and per-path behavior, especially during maintenance operations.
Even procedures that have been safe and repeated many times can still surface unexpected issues. Reduced-redundancy modes deserve extra caution.
What We’ve Changed
We updated our related maintenance procedures, devices will now be placed into maintenance mode and remain in that state longer before any disruptive action begins. This gives us more time to observe and, if needed, revert instantly.
We are rolling out additional, targeted observability that will be made available during ToR maintenance and gives us the visibility needed to support these operations.
Moving Forward
We are continuing to analyze how to further tighten our maintenance safety nets and improve fault detection within our network. Our goal is to ensure that even during reduced redundancy windows, any unexpected behavior is detected quickly and mitigated before it has customer impact.
We sincerely apologize for the disruption and thank you for your trust and patience. If you have any questions or concerns, our support team is available to help.
The Exoscale team
The issue has been resolved.
We’ve confirmed that the incident was caused by routine maintenance on our top-of-rack (ToR) switch. During the upgrade, some traffic was intermittently routed through the affected ToR, leading to occasional packet loss in our infrastructure.
The connectivity issues we mainly related to a top-of-the-rack (TOR) misbehaving. The situation should have been recovered for most, if not all, customers.
We are monitoring the situation.
A top-of-the-rack (TOR) is being reconfigured, and it should improve the situation.
We are investigating the traffic flowing from our edge routers to our internal infrastructure.
IPv6 connectivity is fine, it seems that IPv4 traffic is affected.
We still haven’t found the root cause, the investigation is still ongoing.
From a monitoring perspective, we are not seeing any packet loss.
We are going to drain our transit provider in ZRH1 (Cogent).
We are escalating the severity of the incident since multiple customers seem to be impacted. We are not seeing an emerging pattern.
We are actively investigating the situation.
We are currently suffering from occasional packet loss. Investigations are under way.
← Back