Amazon Explains Cause Behind the North Virginia Cloud Failure
Amazon Web Services (AWS) has issued a formal apology to customers following Monday’s widespread outage that crippled thousands of websites and online services worldwide. Major platforms such as Snapchat, Reddit, and Lloyds Bank were among more than 1,000 affected, after a core system failure in AWS’s North Virginia (US-EAST-1) region disrupted operations on October 20.
In a detailed statement released Wednesday, Amazon said the outage stemmed from “errors that prevented our internal systems from resolving IP addresses,” effectively breaking the link between web domains and their underlying servers. “We apologise for the impact this event caused our customers,” the company said. “We know how critical our services are to our customers, their applications, end users, and businesses.”
Global Impact and Extended Downtime
While most services—including Fortnite and Roblox—were restored within a few hours, several others, including Lloyds Bank and U.S. payment app Venmo, suffered extended downtime well into the afternoon. The outage even reached into unexpected areas: smart bed manufacturer Eight Sleep reported that some of its internet-connected “sleep pods” overheated or got stuck in awkward positions due to loss of connectivity.
The disruption underscored how dependent much of the internet remains on AWS, which dominates global cloud infrastructure alongside Microsoft Azure. Experts say the outage served as a warning about the risks of overreliance on a handful of providers for critical digital operations.
Inside the Failure: “Faulty Automation” and DNS Breakdown
According to Amazon’s post-mortem, the outage originated in a failure within its Domain Name System (DNS) database — a service that translates website names into machine-readable IP addresses. A process in AWS’s primary regional database fell out of sync, creating what Amazon called a “latent race condition” — a dormant software bug that was triggered by an unlikely chain of events.
This triggered cascading system failures, effectively cutting off access to critical backend services. “The specific technical reason is a faulty automation broke the internal ‘address book’ systems that region relied upon,” explained Dr. Junade Ali, a software engineer and fellow at the Institute for Engineering and Technology. “So they couldn’t find one of the other key systems.”
Because much of AWS’s internal infrastructure is automated, the problem spread rapidly before engineers could manually intervene. The company said it is now taking steps to prevent similar incidents, including improvements to its fault isolation systems and better testing of automated recovery tools.
Calls for Greater Cloud Resilience
Industry experts emphasized that the outage highlights the need for greater cloud redundancy and diversification. “Companies that had a single point of failure in this Amazon region were the ones taken offline,” Dr. Ali said. He and others have urged businesses to adopt multi-cloud strategies to mitigate such risks in the future.
AWS concluded its statement by pledging to “do everything we can to learn from this event and improve availability.” Despite the global reach of the outage, the company reaffirmed that customer data remained secure and that no malicious activity was detected.