On 2022/06/29 07:33 UTC our domain exo.io was suspended by the .io top-level domain (TLD) registry following an abuse report.
The domain was re-enabled on 2022/06/29 12:18 UTC and progressively came back online as internet revolvers cache refreshed.
The following services were impacted during the outage:
- Object Storage service (SOS) in all zones
- Scalable Kubernetes Service (SKS) control plane APIs in all zones
- Compute snapshots and template registration (relying on SOS)
How could this ever happen ? Before answering this question, it's necessary to shed some light on the setup behind the exo.io domain.
The exo.io DNS zone is hosted on our own Exoscale DNS service. This DNS service isn't directly operated by Exoscale. DNSimple, our long time partner, is operating the service for us.
For the exo.io domain, DNSimple is both hosting the DNS zone and acting as the domain registrar. This is required for domains which enable DNSSEC functionality. DNSSEC helps improve the zone security for services like SOS and SKS. Without DNSimple being the registrar, DNSSEC wouldn't be an option.
DNSimple itself isn't a true registrar. They are a reseller and therefore rely on an upstream registrar. On top of this upstream registrar is the .io top-level domain (TLD) registry.
Now that we have the overall picture of the setup, let's get into the issue. June 10th, the .io TLD registry sent an abuse notification to DNSimple's upstream registrar. This notification was unfortunately NEVER forwarded to Exoscale or DNSimple and went obviously unanswered. On June 29th the .io TLD subsequently suspended our exo.io domain due to lack of feedback.
What Exoscale does about abuse
As a cloud provider, abuse is sadly a recurrent topic we are familiar and dealing with on a daily basis.
Exoscale is both proactive and reactive on this matter. We won't cover much here the proactive part for obvious reasons. However we can say that every abuse report addressed to us is always handled immediately and accordingly.
Ironically the abuse report we never received and resulted in the domain suspension was already solved prior to its initial notification sent to the registrar on 6/10/2022. We did receive it through another channel, and handled it accordingly the day before.
Why did it take so long to get the domain un-suspended ?
As with most outages, multiple factors came into play.
The chain of providers involved above DNSimple played a critical role. As soon as we detected the spreading outage, we reached out to DNSimple priority support. We quickly had them investigating the issue. Following an initial assessment they did issue a support request to their upstream registrar.
The latency on the chain of intermediaries negatively impacted the time to recovery. DNSimple also did not get an immediate feedback from the upstream registrar to their support request, which significantly delayed the resolution.
Initially both ourselves and DNSimple failed to figure out that the cause was a domain suspension. Even though there were some signs pointing to it, the initial investigtion focused on a potential TLD wide issue related to DNSSEC. That being said, we should also keep in mind that we were not notified and that the issue pattern was the TLD Name Severs no longer providing the exo.io Name Servers in their response.
Lesson learnt & improvements
Following such an event, there's obviously a resulting set of actions to be deployed to ensure it does not happen again.
First, we are working closely with DNSimple to ensure that future abuse reports are reaching us without any exception. We are also reviewing their registrar setup and looking for any improvements they could implement on their side.
Also short-term we are going to introduce additional monitoring of our domains to catch earlier any changes made by the registrar.
In a more long term effort we will also be looking at options to move out of using .io domain for service hosting user content like exo.io. Nic.io has stricter policies when it comes to abuse than other more traditional TLDs. This makes the use of .io more prone to such a domain suspension. Unfortunately SOS object storage service is making extensive use of URLs which makes such a move hard to deploy and not a short term viable option. Moving out SKS service is easier and could be expected mid-term.
Such an incident is unacceptable and should never have happened in the first place. Unfortunately it went beyond our control. We are reviewing any option available to either lower or completely remove the risk from happening again. We are deeply sorry for the impact and inconvenience it caused.
Should you have any question feel free to get in touch with our support.
Loic Lambiel VP Platform