Exoscale
Platform Status

[Network] Increased internet network latencies and packet loss

Minor incident CH-DK-2 API Compute Database service DBaaS Managed Kubernetes SKS Network Load Balancer NLB Object Storage SOS
2023-09-10 20:31 CET · 19 hours, 5 minutes

Updates

Resolved

All systems are back to nominal state. Failed router hardware (line card) has been replaced.

Post Mortem

Today we experienced an internet connectivity blackout in CH-DK-2 of approximately 17 minutes. This blackout is related to the initial hardware failure we experienced over the course of last night.

An unfortunate human mistake during the hardware replacement led to a total loss of internet connectivity in the zone.

During the replacement operation, the wrong healthy line card has been mistakenly removed from our edge router. This line card was holding all the redundant backup connectivity for the zone. Everything was plugged back as soon as the technician realised the mistake. Unfortunately adding back a line card into the router goes through mandatory automatic hardware setup and testing steps before the network ports can be set back to online state. This process took several minutes to complete and finally restore back the connectivity. While the line card was initialising, we took the opportunity to replace the failed one in the other edge router in an attempt to speed-up connectivity recovery to whichever line card gets back up first.

Despite all the measures taken, this mistake led to a catastrophic connectivity loss. We are going to review our operational procedures and introduce safety checks In order to prevent a similar scenario from happening again.

We are deeply sorry for the inconvenience this outage has caused.

Should you have any questions feel free to get in touch with our support.

The Exoscale Team

September 11, 2023 · 15:33 CET
De-escalate

Situation is back to nominal, we are monitoring the situation

September 11, 2023 · 12:55 CET
Investigating

Mitigation applied, traffic is starting to recover

September 11, 2023 · 12:48 CET
Update

Root cause has been identified, mitigation in progress

September 11, 2023 · 12:47 CET
Escalate

We are investigating massive connectivity issue on the zone

September 11, 2023 · 12:38 CET
De-escalate

Incident is being reduce to minor level

September 10, 2023 · 21:38 CET
Update

We are expecting to get the part on 11th Sept during the day. Until that time our redundancy level will be N

September 10, 2023 · 21:31 CET
Update

The crash is related to a hardware issue. We are looking to get the required spare part on the site. Internet connectivity is fully available.

September 10, 2023 · 21:07 CET
Monitoring

One of our core internet edge router experienced a crash. Impacted connectivity has automatically failed over alternate available paths.

September 10, 2023 · 20:35 CET
Issue

We are investigating Increased internet network latencies and packet loss. We’ll post an update as soon as we have more information.

September 10, 2023 · 20:31 CET

← Back