Lessons from the Oscars and the Amazon S3 Outage

I was deep into an all-day executive strategy session on Tuesday, when my phone started buzzing with tweets and frantic messages about an Amazon induced internet meltdown. Eventually, the problem was resolved and Amazon was forthcoming and posted an explanation. I’m not writing today to bash them. They run a great set of services at an incredible scale. And many customers, corporations and services count on AWS to deliver for their businesses every day.

I am going to point out though some of the wonderful benefits and potential pitfalls of Cloud Native Architectures. And this incident is an example of both! Amazon S3 appears to use many of the principles of cloud native and microservices based architectures. They cleanly split functions up modularly into discrete and separate functions. They scale out and provide high availability for each component function independently, delivering a massively scalable, complex to implement service that is trusted all over the internet. They use advanced replication schemes for both copies of data and copies of microservices so that any type of failure will not take down the entire service. All good and very reliable!

But then human error creeps in… And there in lies the flaw. To quote a renowned human behavior expert, Jerry Seinfeld, “People, they’re the worst!” Orchestration and automation of well tested procedures is absolutely required for the cloud native world. To be honest, it was always needed, just more so with microservices. Because the more moving parts you have, the more opportunities there are for manual errors.

In ancient times (i.e. the 1990s), IT used Runbooks for maintenance procedures. This was a document of manual steps to execute in order to replace hardware, perform a software update or maintenance procedure safely. And these runbook recipes needed to be tested, first in isolation, eventually in production to make sure they were correct. If anything in the system was changed, better make sure the runbooks were updated, hence the advent of CMDBs (Change Management Databases). Of course, a runbook that wasn’t tested or exercised wasn’t worth the paper it was written on! It was just a theory, until proven with live execution. Even with a solid and well tested run book, humans performing manual execution steps can have errors. We all make typo’s like I did in writing this blog or we can be distracted and hand the wrong envelope to Warren Beatty. (But I got that selfie with Emma Stone!)

Today’s Cloud Native Applications have much larger scale, many more moving parts, are far more complex in many ways than their enterprise predecessors, but when designed and built according to microservices architecture principles are also simpler to comprehend and insure the reliability of than earlier enterprise applications. But with all the moving parts, automation  is a necessity! We still need runbooks that are exercised and debugged for production usage, but in a cloud native world, they must be automated. Well-designed automation¹ that is thoroughly tested is what fulfills the promise of Cloud Native Architectures to deliver extremely robust, global scale services that are not subject to planned or unplanned outages. Not by component failure. And just as importantly not by human error during maintenance procedures. Nor by distraction on live TV!


¹ Scott Davis is EVP of Product Engineering and CTO of Embotics, the cloud automation company for IT organizations and service providers that need to improve provisioning or enable self-service capabilities. With a relentless focus on delivering a premier user experience and unmatched customer support, Embotics is the fastest and easiest way to automate provisioning across private/public/hybrid cloud infrastructures. Its flagship product, Embotics vCommander, is used by organizations including Nordstrom, NASA, Fanatics, Informatica and Charter Communications. For more information, visit http://www.embotics.com, and follow Embotics on Twitter and LinkedIn.  

This entry was posted in Cloud. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *