Exoscale status

Summary

At 23:31 CEST, our DDOS mitigation provider started to announce to us through BGP our own subnets in CH-GVA-2, likely resulting from a human mistake. This resulted in a massive and transient internet traffic loop impacting the whole CH-GVA-2 zone.

Technical summary

Unfortunately it took our SRE team quite some time to figure out the root cause due to the nature of issue. The BGP announcement from our DDOS provider was partial and more specific (/32). Although we have multiple links across 2 EDGE routers, the announcement was made over links targeting only a single EDGE router. This router installed the routes along the legitimate ones over ECMP [1]. It resulted on a partial internet traffic loop. Traffic flowing over ECMP to the bad routes was looping while remaining traffic took the legitimate routes and was able to reach its destination. Based on our current data, we can assume that 50% to 70% of the internet traffic in CH-GVA-2 has been impacted by this outage.

Simplified, ECMP load balancing works by hashing the flow, resulting in a load balancing made by source and destination IPs. It's why some traffic randomly did reach its destination while some other did not.

We do have filtering in place to avoid learning our own routes from another provider, usually resulting from a human mistake. It appears that this specific announcement did overpass our filters. Further investigation is ongoing to understand how and why this could happen. We are also waiting on getting a feedback from our provider.

In addition, there was a delay for posting this incident on our status page. A configuration issue resulted in an unexpected dependency with a network in CH-GVA-2 which was also impacted by the outage.

We're sorry for the inconvenience caused by this outage. Following final investigation outcome, proper filtering will be setup to ensure such issue won't happen in the future.

[1] https://en.wikipedia.org/wiki/Equal-cost_multi-path_routing