On 28th August, the DE-FRA-1 SOS service started experiencing a significant increase in response time latency and error rate.
The issue was the results of an operational mistake which led the service to partially run out of capacity.
When an object is being deleted from the object storage service, the delete operation flags the object and its data blob as deleted but does not remove physically the data synchronously, this to avoid putting too much load and latency when getting massive delete operations. A lower priority job is then scheduled to perform the data cleanup in the background.
The past weeks we experienced a very significant amount of data deletion requests. These requests were processed accordingly, however the garbage collector processing the deleted objects cleanup quickly became overloaded and started to lag far behind and failed to keep up with the queued deletes and the rate of new / deleted objects happening. Our system ended up in a situation where free capacity has been exhausted by the remaining objects flagged for deletion but not actually processed by the garbage collector. At that moment we started to get an increase in error for PUT requests.
The fix consisted to run the garbage collector in foreground and with high priority on all the storage backend. This resulted in high I/O load and heavily degrading the service performance for several days until the process completed. The most noticeable impact was between 28th and 31st August 2020.
Stored data has never been at risk anytime during the outage.
Where we failed
We failed to monitor the overall progress of the Garbage collector job. We wrongly assumed that deleted objects will be removed in time by the process. Also from an allocation perspective the space used by deleted objects was not taken into account and was considered as free. Same goes for our monitoring.
We've started to completely refactor the way we do the garbage collector process and monitoring to ensure we don't get back into a similar situation.
We are truly sorry for the inconvenience caused by this outage. We're putting everything in place to ensure similar issue doesn't happen again in the future.