Another Amazon Outage Exposes the Cloud's Dark Liningby
It was likely another eventful weekend for the engineers in Amazon’s Web services division. On Sunday afternoon, a hardware failure at Amazon’s U.S.-East data center in North Virginia led to spiraling problems at a host of well-trafficked online services, including Instagram, Vine, AirBnB, and the popular mobile magazine app Flipboard.
Amazon blamed the outage on glitches with a single networking device—what it called a “‘grey’ partial failure” that resulted in data loss—and said it was conducting a forensic analysis. The entire incident lasted all of 49 minutes, but like many recent cloud service outages, the resulting questions are likely to last considerably longer. Why are so many prominent Web companies overly dependent on a single cloud provider—and a single data center?
Amazon and other cloud providers preach the virtues of geographical redundancy: They say customers should spread out their services among multiple data centers, so if one goes down, another can pick up the slack. Sunday’s outage, like so many other recent cloud service snafus, demonstrates that few cloud customers are properly following this orthodoxy and that true redundancy may be much more complicated than it sounds.
Amazon engineer James Hamilton once examined this issue. “Inside a single facility, there are simply too many ways to shoot one’s own foot,” he wrote on his personal blog. While he’s a redundancy advocate, Hamilton also conceded that “with incredible redundancy, comes incredible cost.” Companies fear that employing backup data centers will drive up their expenses and exacerbate latency—the lag time a customer might experience when using their websites or apps.
Another problem is that it’s currently too difficult for companies that use cloud services to hedge their bets across multiple providers. Ideally a company might want to buy infrastructure services from one firm but use others as backups. The major cloud providers don’t eagerly support this kind of interoperability, as one might imagine. OpenStack, an open-source project that can plug into the services of Amazon, Google, and others, has been created to make this easier. But it’s still early for such projects, and corporate customers have yet to coalesce around it.
The biggest surprise from Sunday’s outage may be who was affected. Instagram is now owned by Facebook, which has invested considerable resources to create its own global data network. Vine is owned by Twitter, which has similarly earned its own hard-won infrastructure expertise. Likely both have some degree of redundancy, but it clearly wasn’t enough. And apparently Facebook and Twitter have yet to bring their recent acquisitions onto their own computing networks. After Sunday, they may seek to hasten that move.