Exoscale status

Exoscale network relies on BGP [1] for its internal and external connectivity.

On 15th March around noon, we introduced new network equipments within our core network topology. This operation was part of a capacity add task. Such operation is quite common and scripted.

As soon as the new network equipments were added into the topology, our whole core network control plane fabrics became overloaded. This load spike was quickly identified as being related to massive continuous BGP updates. This resulted in BGP sessions flap across the whole zone. The most noticeable impact was sporadic packet loss and timeouts. At the beginning our core network fabrics were impacted, but due to the nature of BGP mesh, the issue did quickly spread to our edge core internet routers.

While we couldn't find anything wrong with the new devices setup and configuration, we decided to rollback both physical and configuration changes after a few minutes. Unfortunately the revert didn't bring back the situation to nominal and our core network remained affected by BGP updates storm. At this time we started to identify from where these BGP updates were coming from. This took us a significant amount of time as we attempted to isolate part of the network and reload both the core network fabrics and edge routers.

The issue has been mitigated with the introduction of several configuration changes to our edge routers and route reflectors [2] topology. At this stage the root cause remain unknown and further investigation will be conducted to pinpoint it. Upcoming planned network maintenance are to be expected in the coming days or weeks to prevent similar issue to happen again.

