Amazon Apologizes For Cloud Disaster [Silicon Alley Insider]
Now Amazon is explaining what exactly happened to its cloud service that led to the disaster. In a very long and extremely technical response Amazon provided a lot of detail about why the collapse occurred, why it's safe now, and how they're trying to prevent this type of disaster from ever occurring again.
Basically, the collapse was brought about because of the U.S. East region's volumes of the Elastic Block Store (EBS) suddenly became unable to read and write operations that are responsible for distributing, storing, and retrieving data. On April 21st the U.S. East region was scheduled to undergo a network change to upgrade its capacity. During this procedure, traffic is shifted off of one of the redundant routers in the network. This traffic shift was executed improperly and ended up shifting the traffic to a lower capacity network which couldn't handle the traffic it was receiving. This change ended up isolating the primary and secondary networks and several EBS nodes of data within a cluster lost connection. These nodes became "stuck" continuously searching for the necessary storage space. This seemed to have a domino-like effect, impacting a number of databases.
Amazon goes into much greater detail regarding the crash and even greater detail on its efforts to fix the problem and prevent future, similar crashes. Amazon is looking into its network changes and auditing the situation to prevent this triggering mistake from happening again. They are also working to provide further protection to EBS for any possible failures in the future. Further, Amazon is working on a strategy to have EBS fully recover more quickly.
Amazon admitted that they had some problems with the way they were communicating to customers and those procedures will be reviewed as well. Customers in the affected area will even receive a 10-day credit for 100% usage regardless of whether they were fully impacted by the outage.
Finally, Amazon did apologize for the outage. They committed themselves to learning from the event and improving their services in the future.