[API] Increased Error Rates and Latencies
Updates
Post-mortem
Summary
On October 7th 2025, We experienced an internal outage that affected our APIs, portal, control planes, and related services.
The incident was caused by an unexpected interaction between our service discovery and firewall orchestration systems, following an emergency maintenance operation.
What Happened
At Exoscale, firewall management is automated. When workloads join or leave our infrastructure, their firewall rules are automatically created, updated, or removed. This system relies on our service discovery layer, which continuously maintains the list of active nodes in our infrastructure using a peer-to-peer gossip protocol.
Following an emergency reload of one of our redundant internal network gateways, a brief spike in service discovery traffic occurred.
For a few seconds, the discovery service temporarily returned an empty list of nodes for a given topology. Our firewall orchestrator interpreted this as if all nodes had left the topology and, by design, proceeded to remove the corresponding firewall rules across those nodes.
This automated removal immediately interrupted the communication between nodes, including the overlay network that connects our internal systems. As a result, our APIs, orchestration tools, and internal support systems became temporarily unavailable.
Although the service discovery layer recovered automatically within seconds, the firewall orchestrator could not self-heal because it relied on the now-disrupted overlay network.
Our SREs intervened manually to restore the core firewall rules, allowing the orchestrator to reconnect and re-apply the correct configuration.
What We Learned
The orchestrator’s behavior was consistent with its design, but the design itself did not anticipate short-lived, invalid data from the discovery system. This highlighted how automation can amplify transient conditions into larger-scale failures when safety checks are missing.
What We’ve Changed
Following the incident, we introduced several safeguards and design improvements:
-
The orchestrator no longer removes firewall rules based on transient or empty discovery data. Rule deletions now require specific and strict conditions to be met.
-
The firewall rules that maintain our overlay network are now static and cannot be removed automatically.
-
We are expanding our internal simulations to better test how automated systems react to short network or discovery disruptions.
Moving Forward
We recognize the impact that this event had on ours and your own operations.
We’ve taken this incident as an opportunity to strengthen our automation logic, ensuring that even brief transient states cannot propagate into widespread issues.
We sincerely apologize for the disruption and appreciate your trust and patience. If you have any questions or concerns, our support team remains available to help.
Incident has been resolved. We are working on collecting the required information to get the exact root cause and share with an update once we have more information
API and portal are back to service. We are still working on bringing back up all the impact services
We are still investigating the root cause and the impact. So far the impact seems to be located in CH-GVA-2
Please be informed that our support ticket system is currently down as well. Delayed answer to ticket may be expected.
The API in CH-GVA-2 is impacted as well. Raising to major incident
Currently our portal is down. We are investigating the root cause and potential other impacted services
We are investigating an increased error rates and latencies on the API. We’ll post an update as soon as we have more information.
← Back