Last Tuesday (2/28/2017) Amazon’s AWS S3 web service was intermittently unavailable. S3 (Simple Storage Service) is one of the many web services hosted on the Amazon Web Services platform
, AWS. It’s also the most used service hosting everything from the image files used by websites both small and humongous, to database files powering some pretty large e-commerce operations to entire websites and critical data backups for organizations across the globe. In total, it’s estimated that Amazon S3 hosts three to four trillion pieces of data, more commonly referred to as “objects” in S3 land.The list of major S3 users affected by this outage - which lasted a little over five hours - is extensive and would take up too much space to list here. I’ll touch on some of the more well-known companies and services that were affected further down, but what occurred was a cascading set of calamities, and while humorous in a few instances, underscores yet another vulnerability in our global IT infrastructure. But perhaps more importantly, there is a wealth of instructive insights to be gleaned from this incident for us Cybrarians. Think of it as a live demonstration that touches on a wide range of course topics on Cybrary such as Risk Management, Risk Mitigation, Risk Transfer, Cyber Threat Intelligence, and Disaster Recovery Planning to name just a few.
Search for the cause
So what actually went wrong? Crickets. At this point, Amazon hasn’t come forward with what caused the outage. So far, they’ve only provided a timeline of platform status updates. And this is where things only just begin to get weird and comical.The outage began at 12:40 ET, but the S3 dashboard indicated all operation centers were green, including the US-EAST-1 center located in Ashburn, VA. Trivia note: the affected operations center is just a short hop around the Capital Beltway from Cybrary HQ. This misleading status would continue for the duration of the outage and was the result of Amazon hosting the “red light” image files indicating an outage on its own S3 platform. Doh!Initial indications (rumors) are that the S3 outage was due to a software problem on the platform. If history is any guide, a software error was the cause of another extended S3 outage that occurred in July 2008. This particular outage was due to message corruption which Amazon’s error correction protocol then failed to catch. It will be interesting to learn what was behind this latest outage. Whenever Amazon gets around to finally making an announcement.
You can't make this stuff up
The irony and comedy doesn’t stop there, however. Not by a long shot. Third-party outage detection sites such as Down Detector were still either indicating an outage after S3 came back online around 5 pm ET or were themselves unavailable. It seems S3 users, along with many others, hammered these sites for a more accurate indication than what Amazon was telling them. The net effect was an unintentional DDoS attack against these sites rendering them either glacially slow or completely unresponsive.Want more? It seems even tech giant Apple was affected by Amazon’s hiccup. Various Apple cloud services are actually hosted on both Amazon AWS and Microsoft’s Azure cloud computing platforms
. The iTunes store and many services provided by iCloud were affected by the Amazon outage. Tim Cooke probably made a note to himself to begin work on an Apple cloud platform ASAP. And lest we forget, the whole IoT thing took a pie to the face. The Nest platform was rendered useless during the outage leaving homeowners out in the cold with thermostats they could no longer control, lights they couldn’t turn on, and even remote controls temporarily became bricks. I could go on, but why pile on at this point? If you have a Sadistic bent, then here’s a link
to a collection of some really funny Twitter posts at Amazon’s expense.
No shortage of advice
The inevitable onslaught of criticism has been hurled in the brief aftermath thus far, not only at Amazon but at all the affected users of Amazon’s cloud service. There seems to be plenty of blame to go around and smug tech pundits along with armchair CIOs on comment boards have all had something to contribute. Sage advice ranging from Amazon’s overdue need to further divide up the East region center into separate domains to placing the onus on end users to implement an enterprise cloud “hybrid” solution consisting of both cloud and onsite hosting have all been tossed around ad nauseam. If the answer were only that simple.What’s easy to forget in all this is that a primary motivator for moving an organization’s stuff to the cloud is cost savings and reduced overhead. The ability to scale up and down on-demand and only pay for what you use is a pretty big enticement when it comes to cloud computing. The savings in hardware and staff alone is considerable, so flip comments about ponying up for redundant coverage strikes me as insulting and generally missing the point. And in Amazon’s defense, their SLA guarantees 99.9% uptime. Extended outages are rare and S3 availability actually exceeded the SLA with an effective uptime of 100% in 2016. Not too shabby!
Getting back to basics
Sometimes when plans go awry and everything we thought we knew crumbles in our hands, it’s often worth going back to basics. This holds true for a lot of disciplines including sports, music, science, and even the challenges of everyday living. In the case of the S3 outage, I’m confident Amazon will figure something out with them most likely going the route of breaking out their operations centers over more domains, which has probably been in the works anyway. I’m also willing to bet that their software development teams will perform a rigorous post-mortem and then take remedial steps to assure something similar does not occur in the future - providing it does prove to have been the result of a software error.The recourse for end users is perhaps even more challenging, but this is where going back to basics can really make a difference. The courses on Risk Management, including the excellent course on Cyber Threat Intelligence presented by Cybrary’s Dean Pompilio which I'm currently going through, discuss the importance of asset valuation as the initial step in any risk management process. An organization must know the value of its digital assets in order to weigh the cost of protecting them. Sometimes, the risk is either too small or unlikely to invest in the mitigation steps required to protect against it. And I’m willing to bet this is what many of the affected Amazon S3 users are actively engaged in as I type this.