We understand that the recent service disruption on September 16, 2023, had a significant impact on your business operations, and we want to provide you with a detailed account of what transpired during this incident.
Chronology of Events:
07:33 UTC: An issue surfaced that affected a large portion of customers using a Azure’s mission-critical database service.
Immediate Response: Our internal monitoring systems alerted us to this issue, triggering an immediate investigation to assess the situation.
Root Cause Identified: After thorough investigation, we discovered that the core issue was linked to an unexpected power disruption in the underlying network infrastructure with our cloud provider. This disruption led to the temporary unavailability of certain compute nodes, which in turn caused failures and timeouts for SQL Database operations.
Mitigation Steps: To mitigate the initial impact and restore functionality as swiftly as possible, we took the following actions:
Alternative steps of communicating: Attempted vendor suggested workaround of connecting via a new tunnel - operation timed out due to the incident.
Disaster Declaration: We started executing on a previously prepared plan and restoring DB replicas in alternate regions, our backups are geo-replicated to account for this exact usecase. Due to the dataset size restore would have taken 15-20 hours from that point
Complete Service Recovery: Azure team confirms restoration of the service. We abandon the efforts to bring the service up in the alternate location and focus on restoration of services in the primary location.
21:38 UTC: We achieved complete service recovery, resolving the issue and restoring normal operation for all new sessions
Impact to our Service:
Most of the application clients were able to automatically re-connect upon service restoration.
A small subset required users to re-login.
Most drives captured during the downtime were cached on the device and posted to the platform upon service restoration.
A small subset of drives during the incident became irrecoverable. Customers are advised to use manual entry.
Our Commitment to You:
Our choice of infrastructure is our business, not yours, and we take full responsibility for this disruption.
We’re working with individual users who may have remnants of the issue or have questions via our support team
We’re pursuing investigation with our provider to understand why cross-AZ failover didn’t occur as planned.
We’re fixing all the reasons we have not persisted drives when the service was not available and ensuring high durability on -device.
We’re pursuing code improvements that would allow for us to buffer incoming data upstream of DB even with the loss of authentication.