Increased network latency
Updates
Please find below the post-mortem for this incident.
Summary (CET time)
Date & Time: The outage started at around 01:07 and was fully mitigated at around 05:20
Impact: Private Networks were partially unavailable in all zones, impacting all services relying or depending on these networks
Affected Services: Compute instance whenever private network were used, Managed private networks, SOS, Block storage, SKS control planes, All APIs, portal
Resolution Overview: The incident has been mitigated by restoring the routing daemons backing the Private Networks routing infrastructure
Incident timeline (CET time)
Detection & Initial Response
Monday, February 3rd, 01:08 – The on-call Site Reliability Engineer (SRE) was paged due to transient alerts from multiple zones. The investigation began within minutes.
01:25 – The situation continued to degrade without a clear pattern emerging from our observability systems.
01:30 – The on-call engineer attempted an internal escalation, but due to the overwhelming volume of alerts, the escalation was mistakenly acknowledged without action.
Escalation & Investigation
01:49 – The on-call engineer identified the failed escalation and retriggered it. The investigation remained ongoing.
02:07 – Due to a timeout in the escalation process, the issue was automatically escalated to the next response layer.
02:08 – The initial incident status was broadcasted.
02:13 – The escalation succeeded, and additional responders joined. More team members joined at 02:20.
02:25 – The investigation continued without a clear root cause. Meanwhile, internal observability and command-and-control systems were heavily impacted or non-operational, significantly slowing down troubleshooting efforts.
03:30 – The investigation narrowed down to the private network layer and infrastructure.
Resolution & Recovery
04:30 – The root cause was identified, and mitigation was prepared.
04:37 – Mitigation rollout began in CH-GVA-2 zone.
04:40 – Initial service recovery observed in CH-GVA-2 zone.
04:41 – Mitigation preparations began for all remaining zones.
04:45 – 04:52 – Mitigation was rolled out across all other zones.
04:55 – Rollout completed in CH-GVA-2, with private network and impacted services restoring connectivity.
05:05 – Mitigation completed across all zones.
05:10 – All zones continued recovering.
05:30 – Most services fully restored across all zones.
05:53 – Incident officially closed.
Root Cause
The issue was triggered by a monthly scheduled cleanup job designed to remove unused components from our hosts. This job operates on a monthly basis within a randomized 6-hour window across the fleet.
Unexpectedly, the job removed the active routing daemon from our hypervisors. This daemon is essential for advertising network routes for private networks on each hypervisor.
The reason this daemon was incorrectly flagged for removal comes from a race condition introduced during its initial installation. On many hypervisors, this daemon was unexpectedly pulled as a dependency on an external component. When that external component was removed earlier this month, it left an unmet dangling dependency, causing the private network routing daemon to be incorrectly flagged as removable.
Corrective actions
The routing daemon has been restored on the impacted hypervisors. The unexpected dependency has been fixed. The scheduled cleanup job has been disabled across our whole infrastructure.
Lessons Learned & improvements
This outage leaves room for improvement and key learnings. We’ll attempt to detail these below:
-
The time from detection to mitigation was longer than acceptable, due to several key factors:
-
A significant amount of time was lost due to internal escalation and sub-optimal response time. We will review both tooling and processes to ensure a faster and more reliable escalation workflow.
-
The outage disrupted our observability and command-and-control systems, severely limiting visibility into the issue. While we have cross-zone observability as a mitigation, it was ineffective due to the global nature of this incident. We are actively working on improving observability resilience to handle widespread infrastructure failures.
-
The randomized per-host 6-hour window of the purge job caused failures to occur gradually but quickly. This lack of an immediate, uniform failure pattern made it harder to detect the root cause early.
-
Despite internal escalation, overnight outages are inherently more challenging due to the limited number of available engineers. Key personnel were engaged quickly, but incident response capacity was lower than during business hours.
-
Initial early checks on private network core routing appeared normal, but a reduced number of active peering hosts was overlooked, delaying identification of the private network as the failure point.
-
-
The on-call SRE was overwhelmed by the volume of alerts and incident handling, leading to a delay in status broadcasting. We will review our incident communication process to ensure timely updates.
-
Despite efforts to design zones as independent failure domains, this outage propagated globally, contradicting the expectation that multi-zone customers would remain protected. Long-term efforts will focus on strengthening failure containment within a single zone.
-
While the scheduled purge job has served its purpose for years, it was too risky in its current form and scheduled during a non-optimal time window. Future clean-up operations will follow a more controlled and coordinated approach.
-
Our testing process did not detect the unintended dependency created when the external component was removed. We will refine our testing and validation procedures to better identify such dependencies in the future.
We sincerely apologize for the disruption caused by this incident. We understand the critical role our services play in your operations, and we deeply regret the challenges this may have created for you and your teams.
Our priority is always to provide a stable and reliable platform, and we recognize that this incident fell short of that commitment. We are taking this matter seriously and are fully committed to strengthening our systems to prevent future occurrences.
We truly appreciate your patience and trust. If you have any concerns or need further assistance, please don’t hesitate to reach out to our team.
The incident has been resolved.
The outage affected Private network connectivity. All zones have been impacted, with different impact levels. From partial to almost complete, depending on the zone.
Most of our services were also severely affected during this outage as side effect, including: SOS, SKS, APIs, Block storage, Portal
Services with possible impact but to be clarified: DBaaS
We will provide a post-mortem in the coming days as soon as all information have been collected and root cause identified.
We have applied the mitigation across our infrastructure and we are still monitoring the situation.
Fix has been implemented, network connectivity and all services are now recovering
We found the root cause of the issue and we are actively applying a fix.
We are still investigating connectivity issue affecting multiple zones
We are facing an outage of our Private Network service.
Most other products are facing either increased error rate or unavailability.
← Back