April 29, 2011

Root Cause for Amazon EC2, RDS Failure

Root cause for Amazon EC2, RDS failure, by Amazon.

After the recent failures of EC2, RDS which affected a lot of websites using AWS, Amazon summarizes (at the link above) what caused the failures, how did they fix it and steps taken to prevent similar issues in the future.

The Cause
In Amazon’s US East data center, an incorrect manual update to the network configuration led to traffic being routed to a lower-capacity secondary network (instead of a high-capacity primary network), saturating the secondary network.

This caused the affected EBS (Elastic Block Store) nodes to lose both their primary and secondary networks, effectively cutting them off from the network and with no way to mirror their data.

When network connectivity was restored, the large number of EBS nodes tried to mirror, overwhelming the underlying resources and unable to complete the operations – the EBS nodes got ‘stuck’.

When EBS nodes got ‘stuck’, the corresponding EC2’s started failing – EC2’s also got ‘stuck’. The EBS also affected the corresponding RDS nodes causing them to fail.

Details at Amazon’s post.

In addition to steps being taken by them, Amazon recommends that cloud-based websites should be designed to be fault-tolerant and the best practices to achieve fault-tolerance.