Exoscale
Platform Status

[API] Increased Error Rates and Latencies

Major outage Global Portal CH-GVA-2 API
2025-10-07 10:35 CEST · 1 hour, 8 minutes

Updates

Post-mortem

Post-mortem

Summary

On October 7th 2025, We experienced an internal outage that affected our APIs, portal, control planes, and related services.

The incident was caused by an unexpected interaction between our service discovery and firewall orchestration systems, following an emergency maintenance operation.

What Happened

At Exoscale, firewall management is automated. When workloads join or leave our infrastructure, their firewall rules are automatically created, updated, or removed. This system relies on our service discovery layer, which continuously maintains the list of active nodes in our infrastructure using a peer-to-peer gossip protocol.

Following an emergency reload of one of our redundant internal network gateways, a brief spike in service discovery traffic occurred.

For a few seconds, the discovery service temporarily returned an empty list of nodes for a given topology. Our firewall orchestrator interpreted this as if all nodes had left the topology and, by design, proceeded to remove the corresponding firewall rules across those nodes.

This automated removal immediately interrupted the communication between nodes, including the overlay network that connects our internal systems. As a result, our APIs, orchestration tools, and internal support systems became temporarily unavailable.

Although the service discovery layer recovered automatically within seconds, the firewall orchestrator could not self-heal because it relied on the now-disrupted overlay network.

Our SREs intervened manually to restore the core firewall rules, allowing the orchestrator to reconnect and re-apply the correct configuration.

What We Learned

The orchestrator’s behavior was consistent with its design, but the design itself did not anticipate short-lived, invalid data from the discovery system. This highlighted how automation can amplify transient conditions into larger-scale failures when safety checks are missing.

What We’ve Changed

Following the incident, we introduced several safeguards and design improvements:

  • The orchestrator no longer removes firewall rules based on transient or empty discovery data. Rule deletions now require specific and strict conditions to be met.

  • The firewall rules that maintain our overlay network are now static and cannot be removed automatically.

  • We are expanding our internal simulations to better test how automated systems react to short network or discovery disruptions.

Moving Forward

We recognize the impact that this event had on ours and your own operations.

We’ve taken this incident as an opportunity to strengthen our automation logic, ensuring that even brief transient states cannot propagate into widespread issues.

We sincerely apologize for the disruption and appreciate your trust and patience. If you have any questions or concerns, our support team remains available to help.

October 14, 2025 · 23:12 CEST
Resolved

Incident has been resolved. We are working on collecting the required information to get the exact root cause and share with an update once we have more information

October 7, 2025 · 11:43 CEST
Monitoring

All services are back up. We are monitoring the situation.

October 7, 2025 · 11:28 CEST
Update

API and portal are back to service. We are still working on bringing back up all the impact services

October 7, 2025 · 11:24 CEST
Update

Impacted services are recovering

October 7, 2025 · 11:19 CEST
Update

We are rolling out mitigation

October 7, 2025 · 11:13 CEST
Update

The issue has been identified, we are working on mitigation

October 7, 2025 · 11:04 CEST
Investigating

We are still investigating the root cause and the impact. So far the impact seems to be located in CH-GVA-2

Please be informed that our support ticket system is currently down as well. Delayed answer to ticket may be expected.

October 7, 2025 · 10:54 CEST
Escalate

The API in CH-GVA-2 is impacted as well. Raising to major incident

October 7, 2025 · 10:43 CEST
Investigating

Currently our portal is down. We are investigating the root cause and potential other impacted services

October 7, 2025 · 10:41 CEST
Issue

We are investigating an increased error rates and latencies on the API. We’ll post an update as soon as we have more information.

October 7, 2025 · 10:35 CEST

← Back