Increased Error rates on APIs

Updates

Summary

On October the 9th 2025, October the 13th and October the 14th we experienced some slow queries in our database engine ( the 9th including a full lock of the database engine ), leading to multiple outages of our API orchestrating compute instances. It was then impossible to create, delete or update any workload.

Impact

October 9th: 1 hour 13 minutes of outage
October 13th: 39 minutes
October 14th: 1 hour

During each window our Compute API was impacted. Creations, deletions and updates were blocked. The rest of our services were not impacted.

What happened

October 9th:

During a routine rollout of a new version of the Compute orchestrator, we noticed that it could no longer connect to its database engine.
We found out that a lot of the SQL queries were blocked and not processed at all.
The decision was taken to restart the process in order to bring the service back as fast as possible.
The combination of the high load of the machine and the stuck queries resulted in a blocked process that could not be killed.
We were forced to reboot the whole machine in order to bring the database back. In the meantime, we attempted to switch over our leader database to a replica. Unfortunately the mechanism failed to switch while the leader was down.
Following the reboot of the instance, the database came back up and started to accept connections from our orchestrator.

October 13th:

Our monitoring system raised some issues with the Compute API.
The issue was the same as during the incident of the 9th, meaning some queries were stuck in the database engine.
This time, the kill of these queries was sufficient to bring the API back.

October 14th:

Our monitoring system detected some issues with the Compute API, similar to the issue that occurred on the 13th. It was resolved the same way by killing the queries that were stuck.

What we learned

During our post-incident investigation, we identified several suboptimal queries that, while previously performing adequately, began to impact database performance at our current scale.
We also observed that our failover mechanism does not fully cover all failure scenarios and can be improved to handle degraded states more reliably.

What we’ve changed

We deployed a new version of our Compute orchestrator which fixed the suboptimal queries.

We also deployed adjustments to our database connections to fine tune some timeouts and avoid blocking the connections for too long. With shorter timeouts, we expect to avoid the impact of long queries in the database.

We are bringing more visibility to our database engine to be able to detect the issues more proactively. By logging more information about the database engine health, we’ll be able to detect and prevent these long or slow queries.
We are conducting additional tests on our failover mechanism in order to cover edge case scenarios and to ensure we can perform the switch in any critical situations.

October 28, 2025 · 17:28 CET

Resolved

Issue has been fully resolved.

October 9, 2025 · 14:00 CEST

Monitoring

We have identified the root cause and applied a mitigation.

October 9, 2025 · 13:48 CEST

Investigating

We are still investigating the issue. More respondents have been called for diagnosis.

October 9, 2025 · 13:42 CEST

Issue

All API mutation operations are experiencing issues.

Already running services should not be affected.

October 9, 2025 · 13:22 CEST

← Back