Network Connectivity Issues in SOF1

Updates

Summary

On May 4th 2026, all services in our BG-SOF-1 zone were unavailable for approximately 50 minutes. The outage was triggered by a routine configuration change applied to two core switching devices, the hardware through which all zone traffic passes. It was part of preparatory work for an upcoming planned network upgrade.

The change itself was straightforward and required no special maintenance window. What caused the outage was a combination of factors: A behaviour in our legacy configuration tooling that silently replaced the full list of active network interfaces instead of adding to them. A configuration mistake that went undetected because that tooling does not expose the full resulting state before a change is applied. And finally, an observation window between the two devices that was shorter than what our monitoring system needs to surface an issue.

This failure mode is specific to the older tooling and network infrastructure currently in place in BG-SOF-1. On the modern configuration system already running in our other zones, the full resulting state of any change is visible before it is applied, making this class of error significantly harder to miss. No data was lost during the incident.

What Happened

BG-SOF-1 zone is undergoing a network revamp. Its edge routers, connecting our network to the internet, are being replaced with a newer generation. The core switching layer is planned next, as part of the network zone overhaul. A configuration change was prepared to connect the new edge hardware to the two existing core switching devices. This is a routine type of change, performed regularly across the platform.
The change was reviewed and passed pre-deployment validation. What neither the validation output nor the reviewers could see was that our legacy configuration management tooling would rewrite the full list of active interfaces on each device rather than append to them, resulting in silently dropping the existing interfaces carrying live traffic. The legacy tooling does not produce a complete before-and-after view of what will be applied, it only confirms which configuration resources will be touched.

The change was applied to the first device, and a pause was observed before proceeding to the second, as expected practice when modifying redundant components. However, our monitoring system requires more time than what was allowed to detect and surface a connectivity issue. The first device appeared healthy, no alert had fired, and the change was applied to the second device. At that point, with both devices misconfigured, the zone lost connectivity entirely.

Our network team identified the root cause by inspecting the devices directly via our emergency management network and manually restored the correct configuration, following failed change revert and switch reboot attempts. Traffic began recovering at 10:49 AM UTC and the zone was fully back up by approximately 10:57 AM UTC.

Timeline (UTC)

10:02 AM Change applied to the first device. Pause observed. The device appears healthy, no alerts.
10:04 AM Change applied to the second device. Zone wide connectivity was immediately lost.
10:06 AM On-call team identifies outage. Alarm raised.
10:08 AM Full incident response activated.
10:15 AM Emergency access established via emergency management network.
10:34 AM Configuration rollback applied. Device restarts attempted without full recovery.
10:46 AM Incomplete device configuration identified as root cause.
10:49 AM Manual correction applied to first device. Traffic begins recovering.
10:51 AM Same correction applied to second device.
10:57 AM Zone fully recovered.

What We Learned

Three factors combined turned a routine change into a zone-wide outage. Our legacy configuration management tooling did not show the full resulting configuration, so the mistake in the change was not caught during review or validation. The observation window between the two devices was too short for our monitoring to fire. And once the first device showed no visible problem, there was no signal to pause before updating the second.

Better tooling would have made the configuration mistake visible before deployment. A longer observation window would have given monitoring the time to catch it after the first device. Either would have been enough to prevent the outage. We did not have either in place.

What We’ve Changed

The required observation window between changes to redundant devices was extended and formalised, aligned with the time our monitoring system needs to reliably detect an issue.

Overhaul of the BG-SOF-1 zone edge and core networking to a more modern infrastructure is being prioritized.

Moving Forward

This outage had three contributing factors: a configuration mistake, a tooling gap that prevented it from being caught, and an observation window that was too short to catch it at runtime.

We are addressing all three. The validation improvements and the extended observation window are changes we are applying now. We are prioritizing the migration of BG-SOF-1 zone networking to align with the more modern infrastructure already running in our other zones. Once completed, all configuration changes will require an explicit, fully visible diff to be approved before anything is applied, eliminating the class of silent error that made this outage possible.

We sincerely apologize to all customers who experienced disruption during this incident. We understand how important the reliability of your infrastructure is, and we take full responsibility for falling short of the standard you expect from us. The changes described above reflect our commitment to ensure it doesn’t happen again.

May 5, 2026 · 23:07 CEST

Resolved

The issue has been resolved.

We’ll add a post-mortem as soon as the exact root-cause and the course of events are established.

May 4, 2026 · 13:06 CEST

Monitoring

Connectivity is back to nominal. We are keep monitoring the situation.

May 4, 2026 · 12:59 CEST

Monitoring

We are monitoring the connectivity recovery

May 4, 2026 · 12:52 CEST

Update

Mitigation has been applied. Connectivity is restoring.

May 4, 2026 · 12:50 CEST

Update

The core network is affected. We applied a set of mitigation which unfortunately didn’t improved nor restored the connectivity. We are still working to restore connectivity.

May 4, 2026 · 12:38 CEST

Update

The issue has been identified. We are working on mitigation options

May 4, 2026 · 12:27 CEST

Investigating

We are experiencing a major connectivity incident within the zone. We are investigating the issue

May 4, 2026 · 12:18 CEST

Issue

We are currently investigating a network issue affecting the SOF1 zone.

May 4, 2026 · 12:16 CEST

← Back