“Chaos doesn’t cause problems it reveals them” – Nora Jones, Senior Software Engineer, Netflix
Nora Jones, Senior Software Engineer on the Chaos Team at Netflix, presented the subject of Chaos Engineering and explained how it has matured through a process of experimentation. Now, we are lucky enough to benefit from the efforts at Netflix and other organisations to begin to apply these methods to the complex, often chaotic architectures that are being built.
Netflix are the poster child of availability and advanced AWS usage. For example, one of the metrics they use to measure availability is Stream Starts Per Second, which translates to the moment you press the button to begin a stream. If I understood correctly, the divergence from the norm indicates a potential issue, with customers unable to stream Netflix content.
What underpins this is the method by which Netflix test platforms regularly to see if changes being introduced will affect the services in a negative way. Using a control group and an experimental group, they purposefully introduce variables that reflect real world events such as servers that crash, malfunctioning hard drives and severed network connections etc. During the testing, if the divergence from the norm reaches a certain level, the experiment is shorted (essentially stopped) and the failure marked. This means their customers remain unaffected, and an engineer can work in the background to fix the issue.
You might be thinking, where are you going with this Phil? Let me explain;
In the world of Financial Services, many of the applications are still part of monolithic and highly dependent architectures. Operating a Chaos Engineering process whilst applications are still built in this way, could be very damaging. However, if we look at the challenger banks for example, many of them are using an underlying methodology called Microservices to develop, build and deliver their services to happy customers. Decomposing the applications (or in their case, composing them as small components in the first place), allows for a multitude of availability benefits to be baked in from the start.
Consider blast radius when you operate a monolithic application, if you lose the single server that your monolith relies on, then the entire service may no longer available. The pattern is usually different with Microservices, provided they employ better practices such as good fallback procedures to ensure services remain available.
Chaos Engineering and the process of applying this to the way in which these Microservices are developed, built, tested and deployed ensures that you can provide availability in chaotic architectures.
If you look at the high-level architecture for Amazon.com, or even Netflix, there is no way a human could understand the interdependencies and track the potential impact of each issue. By implementing procedures to test how a service, or for this matter a chaotic web of interconnected services react to a negative event, businesses are gaining better availability. Not to mention the thousands of man hours that are lost when troubleshooting platform issues that can cost a business millions in lost revenue and worse still, affecting customer sentiment.
In the UK, the more established and older FS institutions have a long way to come. But by beginning the task of breaking down monolithic applications, implementing microservices architectures and using Chaos Engineering, risk will be vastly reduced. This doesn’t just benefit the customers in the sense of availability, it covers many other negative scenarios that can arise from the loss of services and functionality during negative events.
“Everything fails all the time” – Werner Vogels, CTO, Amazon.com
This is a key principle that underlies this way of thinking, forcing an organisation to move through the motions of testing a platform for all of the conceivable scenarios with minute portions of customer interactions to provide service assurance.
Another key mindshift is that an organisation has to stop despising the concept of experimentation. When you allow your clever systems engineers to create experiments in safe environments to test their hypothesis around how a platform might react to a change or development, you allow for protection of the production systems that are far more sensitive to change.
Imagine, rather than a CAB to review a number of changes that contain only assumptions about how the platform might react when the change is implemented, you have a system whereby you hold the data and proof of what the change will do. For the FSI that massively de-risks change, but this only works once you have reduced the blast radius, decoupled the services, implemented a Microservice architecture and built your Chaos Engineering tests.
We look forward to engaging with our customers as their chosen partners to help them in carrying out all of these tasks. ECS can offer guidance on how to experiment safely and provide guidance on your journey to providing better solutions to your customers. Speak to us about your first experiment today!
Finally, for some more information on Chaos Engineering, Nora Jones provided us with the book that she and her co-writers have published – www.principlesofchaos.org
Keep up to date with the AWS re:Invent 2017 conference by following our Twitter page and LinkedIn page. All this week, our teams will be attending and blogging live from the conference in Las Vegas – check our webpage for keynote takeaways as well as a summary of each conference day.