Chaos Engineering - The Best Place to Experiment

I will predicate this post that I have been bringing down services in production before Chaos Engineering was an acceptable practice. As our systems grow more complex and distributed, the “fog of development” will certainly appear. Similar to the fog of war which is a military term for uncertainty in situational awareness e.g attaining precision and certainty. In the fog of development, as we move faster by increasing agility, understanding confidently how our complex systems behave becomes more challenging as we try to make changes.

Not to be quagmired into analysis paralysis at some point we have to make a change and learn from the change we made. Though as humans we come and go on projects and our systems we work on are the aggregate of potentially years or decisions before and after your time. Building systemic confidence in the end-to-end system which could have been touched by many opinions before your time e.g cutting through the fog is Chaos Engineering. Chaos can take many forms and a popular form is in the form of a black swan.

What is Chaos? – Hello Black Swan

A popular piece of reading for Site Reliability Engineers [SREs] is Nicholas Taleb’s The Black Swan:  The Impact of the Highly Improbable (2007). Taleb introduces the black swan metaphor in his piece. Taleb would classify black swan events such as a sudden natural disaster or in business at the time of publishing his piece Google’s astounding success.  A black swan event has three characteristics; it’s unpredictable, it’s impact is massive, and when it’s over we devise an explanation that makes the black swan seem less random.

When dealing with the fog of development, we are prone to the fallacies of distributed computing which are a set of pivotal assertions made by computer scientist Peter Deutsch. Some of the top fallacies are the network is reliable, latency is zero, bandwidth is infinite, and there is only one administrator. Distilling the fallacies down your services will be consistent and available at all times. As we know systems and services come up and down all the time but when getting into the minutia of developing the unknown we can easily forget this.

Let’s take for example we are building some features that rely on Amazon S3 for object storage. If we are building features for a service that does complex processing and the final output is writing or updating an object in S3, we as engineers might assume that S3 will be there. We test our features up and down and provide less sophisticated test coverage to the S3 portion. Amazon Web Services had a black swan event of its own in 2017 when S3 suffered an outage. Something that we assumed that would be there [even with a lowered performance/write SLA] was not and the fallacies of distributed computing came back to bite us.

The S3 outage really helped to shine a light on making sure we touch all parts of our stack even if the parts we touch don’t seem obvious perhaps due to our perception/fog around the fallacies of distributed computing Chaos Engineering brings controlled chaos so we can shake these types of events out.

Estimating Chaos – Hello Chaos Engineering

Chaos Engineering is the science behind intentionally injecting failure into systems to gage its resiliency. An informative creed, the Principles of Chaos Engineering, emphasizes the need for a method of comparing a baseline to a hypothesis on what will happen as chaos is injected.

Some cursory Googling if you are early on your Chaos Engineering journey will come up with a tool called Chaos Monkey. Created by Netflix in 2011, Chaos Monkey is the tool credited for bringing Chaos Engineering into the mainstream. Chaos Monkey, for example, would terminate running Amazon Machine Images to help test application resiliency in the public cloud. Fast forward to today there are entire Chaos Engineering platforms such as Gremlin that help us package up lots of Chaos Engineering science.  

I am always a big fan of “awesome lists” on GitHub and for those learning about Chaos Engineering, there is an awesome list, Awesome Chaos Engineering. As we continue to move towards highly distributed architectures, the number of moving parts increases. More mature testing methodologies such as load testing are there stress our systems but Chaos Engineering shines lights into different areas.  

Are Load Tests similar to Chaos Engineering Tests?

Certainly, load can bring on chaos per se. We commonly design are systems to be elastic in multiple pieces e.g spinning up additional application, compute, networking, and persistence nodes to cope with the load. That is assuming that everything comes up at the same/appropriate time so we can get ahead of the load.

In the computer science world, the Thundering Herd problem is not new but manifests itself more commonly as we moved towards more distributed architecture. A Thundering Herd problem at the machine level as a large number of processes are kicked off, another process becomes the bottleneck e.g the ability to handle one and only one of the new processes. In a distributed architecture, a Thundering Herd might be your messaging system is able to ingest a large number of messages/events at a time but processing/persisting those messages might become a bottleneck. If you are overrun with messages, hello Thundering Herd.    

A load test would certainly help us prepare for a Thundering Herd as one type of stress but what if part of the system was not even there or late to the game? That is where Chaos Engineering comes in. A very hard item to test for would be a cascading failure without Chaos Engineering. Historically more equated with the power grid, a cascading failure is a failure of one part that can trigger failures in other parts. In distributed system land, this is us trying to find a single point of failure and making sure our application/infrastructure is robust enough to handle failures.

A large part of the investment where Chaos Engineering and Site Reliability Engineering intersect is trying to have more control over how problems/failures manifest themselves to users. The adage “slow is the new down” is very true as user expectations continue to rise. Investments in Chaos Engineering will continue to rise as we as an industry continue to raise the bar on our distributed systems engineering craft.

Related Posts