Post-mortem

Summary

On February 12, 2026, between 23:14 and 02:24 CET, we experienced significant network degradation in our Munich availability zone DE-MUC-1.
While no services became fully unavailable, customers experienced significant packet loss, high latency and intermittent connectivity. The root cause was a physical cabling event that triggered a network loop. This resulted in severe congestion that impacted both customer traffic and the availability of our monitoring systems.

What Happened

In Munich, we are currently preparing a new datacenter facility to replace our current site. To facilitate the upcoming migration, both sites are interconnected with high speed networking to operate as a single logical environment.

At 23:14 CET, a physical cross-connect was patched between two network devices in the new facility as part of site preparation. While the work was necessary, it was not expected at that time and not properly communicated to the Network Operations Center (NOC) prior to execution.
The new connection created a circular path between the switches. In this specific zone architecture, the redundant path was not immediately blocked, creating a forwarding loop. This generated a significant volume of internal traffic that competed with valid customer data.

The incident was prolonged because the network congestion was severe enough to disrupt our internal monitoring. As the loop consumed available bandwidth, our monitoring systems struggled to collect and display real-time data from the affected devices.

With visibility degraded, our on-call SRE engineers could not immediately correlate the performance drop with the physical changes happening in the datacenter. They initially investigated external factors, such as upstream provider instability.

To regain visibility, our SRE engineers bypassed the congested production network by switching to our emergency Out-of-Band (OOB) management network. This separate access path allowed us to retrieve the necessary logs to verify the loop and identify the specific ports involved.
Once the source was confirmed via the OOB network, we administratively shut down the interconnect ports linking the two datacenter sites. This action broke the loop, and network metrics returned to normal baseline by 02:24 CET.

What We Learned

The primary factor extending the incident was the lack of correlation between physical actions and logical symptoms. When cabling work changes are not advertised to the Network Operations Center (NOC), troubleshooting relies on guesswork rather than causality.

Extreme network load doesn’t just slow down customer traffic, it can also starve the monitoring tools required to diagnose the issue. When the pipes are full, diagnostic and telemetry data may not get through, creating a blind spot during the incident.

The impact radius was too wide. A single cabling error during a setup phase should not have the ability to degrade performance across the entire active availability zone.

What We’ve Changed

We have updated our datacenter operations protocol. To ensure better correlation between physical activities and network performance, we now enforce a strict internal notification policy.

Any physical datacenter activity must be notified to the related platform teams upon start and upon completion.

This ensures that our on-call engineers are always aware of active work zones and can immediately link a performance anomaly to ongoing physical maintenance.

Moving Forward

We are currently evaluating mechanisms to better contain network loops. We are exploring options such as automated port isolation and traffic limiting to ensure that a future loop is quarantined to a single switch rather than affecting the broader zone.

While our Out-of-Band network was successful in resolving this incident, we are reviewing our procedures to ensure our teams switch more easily and faster to this emergency path to reach network equipment during periods of high congestion.

We sincerely apologize for the disruption and thank you for your trust and patience. If you have any questions or concerns, our support team is here to help.

The Exoscale team

February 18, 2026 · 23:23 CET

Resolved

The situation is back to normal and the connectivity is working as normal too.
We’re closing the incident.

February 13, 2026 · 02:52 CET

Monitoring

Situation should improve and the connectivity should get better now in the zone.
We’re still having a look on the issue.

February 13, 2026 · 02:24 CET

Monitoring

Mitigation have been introduced, we’re monitoring the situation

February 13, 2026 · 02:21 CET

Investigating