On September 2nd, 20181, we suffered a wave of hard drive failures, some impacting three replicas of the same partitions. Within Exoscale SOS, objects are split in small blobs of up to 16Mb and distributed among logical 100GB partitions which are replicated three times. This approach allows better distribution of object contents and better throughput profiles, but also means that a single blob loss results in the corruption of a whole object.
This exceptional hard drive failure rate - 10%, while we usually account for 2% to 3% on a normal basis - led to the failure of three replica of a subset of partitions. Coupled with an edge-case in the replication logic where drives incorrectly reported success status for writes resulted in the loss of several blobs. This means that a number of objects across several customer buckets were corrupted and are now inaccessible.
We have taken and will be taking a number of steps to rectify the situation, including opening a case with our hard drive vendors to understand the much higher than usual failure rate we saw in drives recently. We also have tightened and improved our software to better cope with hardware failures, transient or not.
Regardless, for over six years our key focus has been to ensure your data and workloads are safe and today we have failed to deliver on that promise, for which we are sorry. Going forward we will make the availability and durability guarantees of our object storage platform clearer and will communicate on the improvements we bring to the platform.
Sincerely, The Exoscale team