Break Things on Purpose: An Introduction to Chaos Engineering
1. What chaos engineering is, and what it isn’t
Chaos engineering is empiricism applied to system reliability. You hypothesize that a system will behave a certain way under stress, you inject the stress, and you compare the observed behavior to the prediction. The point is not to break things. The point is to discover that the system is already broken in a way you weren’t aware of.
First, it’s experimental. A chaos experiment without a hypothesis is just sabotage. What you learn from a controlled experiment is that your model of the system was wrong, and you only learn that if you wrote the model down first.
Second, it targets the system, not the code. A unit test asks whether a function behaves correctly given certain inputs. A chaos experiment asks whether the whole system keeps serving traffic when a dependency goes away, when latency spikes, when an availability zone disappears. Those failure modes don’t live in any single function. They live in the choreography between processes.
Third, it’s not the same as fault tolerance, QA, or disaster recovery, even though it touches all three. Fault tolerance is a property the system aims to have. QA verifies that features work as designed. Disaster recovery is a procedure you execute after something has already gone wrong. Chaos engineering is the practice that checks whether your fault tolerance is real, before reality checks it for you.
The bumper-sticker version is “break things on purpose.” It’s a useful slogan because it shocks people into paying attention, but it’s also misleading. You’re not breaking things. The things are already breakable. You’re just choosing to find out under conditions you control.
2. From Netflix’s outage to the Simian Army
The cultural mechanism they invented was Chaos Monkey, built internally around 2010 by Netflix and open-sourced in 2012. Its job description was simple: during business hours, pick a random virtual machine in production and terminate it. Engineers learned very quickly to write services that survived the death of any single host.
Chaos Monkey was the first piece of a larger lineup that Netflix called the Simian Army. The cast targeted progressively wider blast radii:
- Latency Monkey injected artificial delays into RPC calls, simulating service degradation rather than outright failure.
- Conformity Monkey found instances that drifted from internal best practices and shut them down.
- Janitor Monkey cleaned up unused resources, which sounds like cost optimization but doubled as a check that nothing depended on dead infrastructure.
- Chaos Gorilla took down an entire AWS availability zone.
- Chaos Kong took down a whole region.
Each rung is only safe to climb once everything below it routinely survives the experiments at lower rungs. If Chaos Monkey is still revealing new failure modes, you have no business running Chaos Gorilla.
The naming convention sounds whimsical, and the whimsy was deliberate. Selling “we deliberately kill production servers” to a risk-averse executive is easier when the tool is named after a monkey than when it’s called RegionalFailureInjector v2 😄
The monkeys have since been deprecated as standalone projects. Chaos Monkey, Janitor Monkey and Conformity Monkey got folded into Spinnaker, which is Netflix’s deployment system.
The name “Chaos Monkey” outlived the original code because the cultural artifact mattered more than the particular implementation.
3. The principles of chaos engineering
In 2017, a group of practitioners led by Casey Rosenthal and Nora Jones published the Principles of Chaos Engineering at principlesofchaos.org. The document is short and worth reading in full, but it compresses into five rules, and the rules are what separate cargo-culting from real practice.
- Build a hypothesis around steady-state behavior. Before you inject anything, define what “normal” looks like in measurable terms. Requests per second, error rate, p99 latency, business KPI. The hypothesis is then “this steady state will hold even when X happens.”
- Vary real-world events. Inject the failures that actually happen in production. Hardware loss, network partitions, certificate expiry, dependency latency. Made-up failures don’t generalize.
- Run experiments in production. Staging environments lie. They have different data shapes, traffic patterns and scale. The only environment that tells you the truth about your production system is your production system.
- Automate experiments to run continuously. A chaos experiment you run once tells you about that day. Reliability is not a state; it’s a property the system maintains across changes. Automation keeps the property under test as the codebase moves.
- Minimize blast radius. Run the smallest experiment that could disprove the hypothesis. Inject failure into a subset of instances, a fraction of traffic, one region. If the experiment fails, you want it to fail in a way that affects ten users for thirty seconds, not the whole platform.
The first rule is the one most teams skip, and skipping it is what turns chaos engineering into theater. Without a steady-state hypothesis you don’t know whether the experiment told you anything. A template like this is enough:
If you can’t write that paragraph, you’re not ready to inject the latency. Without a written hypothesis and abort conditions, you have no way to recognize an unsuccessful experiment, and a chaos experiment whose outcome you can’t interpret is worse than no experiment at all.
4. Where chaos engineering sits in reliability engineering
Chaos engineering is one of several practices that operationalize a shift the broader field has been making for two decades: from “make failure rare” to “make recovery fast.” Mean time between failures (MTBF) was the dominant metric of the pre-cloud era. In a world where you provision and operate everything yourself, you can reasonably treat outages as the exception. In a world built on shared infrastructure, intermittent failure is the substrate. MTBF gives way to mean time to recovery (MTTR). Reliability becomes a property of how quickly the system heals rather than how rarely it breaks.
That reframing is the connective tissue between several adjacent disciplines.
Site Reliability Engineering in the Google lineage uses error budgets to make the tradeoff between feature velocity and reliability explicit. If you’ve spent your budget, you stop shipping until reliability recovers. Chaos engineering fits inside SRE as one of the techniques that measures and validates the actual reliability the budget is tracking.
Resilience engineering in the Cook, Dekker, and Allspaw lineage takes a different angle. Where SRE is a control system with numbers, resilience engineering is a body of work about how humans and complex systems interact under stress. Richard Cook’s 1998 paper How Complex Systems Fail is the foundational text. Its central claim, that catastrophic failure in a complex system is always the product of multiple small failures aligning, is the intellectual seed that makes chaos engineering plausible. If failure is normal and concentrated incidents are produced by alignments, then perturbing the system continuously is a way to keep those alignments shallow.
Older relatives include failure mode and effects analysis (FMEA), which has been standard in safety-critical engineering since the 1960s, plus disaster recovery drills and the tabletop exercises common in security and emergency response. All of these encode the same insight: practiced response beats theorized response. Chaos engineering is the version of the insight that’s automated, continuous, and runs in production.
What distinguishes chaos engineering from its ancestors is not the philosophy. The philosophy is old. What’s new is that cloud infrastructure made it cheap enough to do continuously. You can terminate an AWS instance with an API call. You couldn’t, in any practical sense, terminate a rack in a colocation facility on a cron schedule.
The economics changed; the idea didn’t.
5. The tools landscape
The current tools landscape splits primarily by the layer at which faults are injected: infrastructure, pod, or connection. Most teams pick based on where their workloads run.
At the infrastructure layer, AWS Fault Injection Simulator and Azure Chaos Studio are the obvious choices if you’re already on those clouds. They can stop instances, throttle network throughput, raise CPU pressure, and simulate availability-zone-level disruption with the same IAM model and audit trail you’re already using.
For Kubernetes workloads, the CNCF (Cloud Native Computing Foundation) landscape converges on two projects. Litmus focuses on a workflow-driven experience: declarative experiment specs, a frontend for running game days, a library of pre-built faults. Chaos Mesh leans more toward pod-level and traffic-level injection: kernel-level network delays, IO faults via FUSE (filesystem in userspace), per-container resource pressure. The two overlap, and the choice usually comes down to which UX fits the team’s existing workflow.
At the connection layer, Toxiproxy (built and open-sourced by Shopify) sits as a TCP proxy in front of a dependency and applies “toxics” to the connection: latency, bandwidth limits, slow-close, connection resets. The advantage over infrastructure-level injection is precision. You can test what happens when one specific service starts behaving badly without having to take anything down.
Gremlin is the commercial entrant most people will recognize. It’s the closest thing to a managed Chaos Monkey for the rest of us: agent on every host, web UI, pre-built attack library, role-based access control to limit who can break what. The tradeoff is the usual one: less work to set up, more lock-in once it’s wired into your platform.
Chaos Monkey itself, the tool that started the conversation, is now part of Spinnaker rather than its own project. The Simian Army repository is archived. That’s not a sign that chaos engineering lost. The practice escaped the tool: instead of one monkey terminating one VM, the same idea is now distributed across cloud-vendor APIs, CNCF projects and a commercial market.
The lesson from the landscape is not that you should pick the most sophisticated tool. The choice should match the failure modes you actually fear. If your service has never lost an AWS region but loses dependencies weekly, Toxiproxy will teach you more than Chaos Kong.
6. Prerequisites for running chaos in production
You should not run chaos experiments in production until your team can detect, contain, and explain failures in production. That’s not a moral position. Without those capabilities, a chaos experiment produces noise rather than learning, and it erodes the team’s appetite for the practice before the practice has had a chance to work.
The minimum kit, roughly in the order you need it:
- [[3-Topics/Reliability Engineering/Software Instrumentation, Observability, and the Rise of OpenTelemetry/Index|Observability]] good enough to detect a problem within seconds. If you can’t tell from a dashboard that your steady state has broken, you can’t run the experiment safely. “Good enough” here is concrete: request rate, error rate, and tail latency on the affected service, visible without context-switching, with alerts wired into a pager.
- A working rollback or kill switch. Whatever you injected, you need to be able to undo within seconds. For an AWS FIS experiment that’s the stop-experiment button. For a deployment-driven fault, it’s a one-command deploy of the previous revision. If the rollback is “page the on-call and hope,” you are not ready.
- Deploy confidence. If your team is afraid of pushing to production on a normal day, deliberately introducing failure on top of that fear is an act of self-harm. Chaos engineering presupposes that your normal release process is calm.
- A blameless postmortem culture. Every chaos experiment, especially the ones that surprise you, produces a finding. If the org’s reflex on finding a weakness is to look for someone to punish, engineers will stop running experiments. Or worse, they’ll start running easy ones that don’t reveal anything.
Most teams don’t have all four when they first hear about chaos engineering, and that’s fine. The right starting point is not to spin up Chaos Mesh in your production cluster. It’s game days: a scheduled session where engineers sit in a room (or a video call), pick a failure mode, inject it into a non-production environment and watch what happens. Game days expose every gap on the list above in a contained setting where the cost of discovery is half a working day, not a customer-facing incident.
Once game days are routine and the gaps are filled, you can move into production injection at small blast radius. From there, automation. None of this is fast, and that’s a feature. Chaos engineering rewards organizations that have already invested in operability. It punishes the ones that adopt the tool hoping the tool will do the investing for them.
That’s the bite under the introduction: the monkey is the easy part.