$ emrebener
home topics reliability engineering importance of software resilience

Importance of Software Resilience

author: emre bener read time: 11 min about: software resilience, reliability engineering
published: updated: mentions: fault tolerance, high availability, chaos engineering, observability, failover, it disaster recovery

Software resilience is the discipline of building systems that keep working under stress and recover quickly when they don’t. Nothing that runs payment rails or medical records stays up by accident, and the work that keeps it up has a shape distinct from “is the feature done”. This post is the orientation for the rest of the Reliability Engineering topic: what the word means, why it earns its own attention, and how the techniques fit together.

1. What resilience actually means

The definition above packs two related properties into one word: continuing to serve users while something is wrong, and getting back to normal afterwards. A resilient checkout flow handles a flaky payment provider by retrying through a backup, returns a useful error if both fail, and is fully operational the moment the provider comes back. None of that is about the happy path.

Resilience gets confused with three nearby words that mean specific, narrower things.

Reliability is the probability that a system performs its function correctly over a given period. A service with 99.9% reliability over a month is one that did the right thing 99.9% of the time. Reliability is a measurement; resilience is what you do so the measurement comes out well.

Availability is the fraction of time the system is up and answering requests. The “three nines” and “four nines” numbers (99.9%, 99.99%) are availability targets. Availability is one of the outcomes resilience produces, not the same thing as resilience itself. A system can be highly available and still brittle, propped up by a fragile dependency that has not yet failed.

Fault tolerance is the property of continuing to operate when a specific class of failure occurs. A fault-tolerant database survives a node loss; a fault-tolerant request path survives a slow downstream. Fault tolerance is a building block for resilience. Resilience also covers what you do when the failure is one you did not design for, which is most of them.

Shorthand: reliability is the score, availability is the uptime, fault tolerance is the local defense. Resilience is the whole posture.

Resilience and three neighboring conceptsResilience — the whole postureKeep serving users under stress; recover fast when something breaks.Reliabilitythe scoreProbability of correctbehavior over aperiod.A measurement,not an action.Availabilitythe uptimeFraction of time thesystem answersrequests.An outcome resilienceproduces.Fault tolerancethe local defenseSurviving a specificclass of failure.A building blockresilience uses.Resilience contains all three: it does the work that makes the score good, the uptimereal, and the defenses count.Resilience and three neighboring conceptsResilience — the whole postureKeep serving users under stress; recover fast when something breaks.Reliabilitythe scoreProbability of correctbehavior over aperiod.A measurement,not an action.Availabilitythe uptimeFraction of time thesystem answersrequests.An outcome resilienceproduces.Fault tolerancethe local defenseSurviving a specificclass of failure.A building blockresilience uses.Resilience contains all three: it does the work that makes the score good, the uptimereal, and the defenses count.

2. Why steady-state thinking misses resilience

A system that works under normal conditions tells you almost nothing about how it works under abnormal ones, and abnormal conditions are where the cost lands. Latency stays flat right up until a thread pool saturates. Error rate sits at zero right up until a dependency timeout cascades. The interesting behavior lives at the edges of the operating envelope, and steady-state dashboards do not look there.

The cost of getting this wrong is asymmetric. Everyday minutes of correct behavior are invisible to users and worth nothing extra; minutes of incorrect behavior are visible to everyone and worth a great deal of negative attention. A payment processor that succeeds a million times in a row gets no credit for the million. The one transaction that hangs at checkout is the one the customer remembers, tweets about, and uses to decide whether to come back.

The same asymmetry shows up across the business. A single outage during a sales event can erase a quarter of optimization wins, one data-loss incident outlives a decade of clean uptime, and a brittle system burns the people responsible for it long before it burns the business.

Resilience is the work you do because you take that asymmetry seriously: treating the unknown failure mode as more important than the known happy path, across the design, the operations, and the team.

3. The four jobs of a resilient system

Every resilience technique does one of four jobs: preventing a failure, absorbing one when prevention is not enough, detecting one that slipped through, or recovering after the damage is done. The technique list itself is long and gets longer every year, and organizing it by what each technique buys you is the only way to keep track. The rest of this section walks the four in order.

The four jobs of a resilient systemFailure lifecycle: before → during → afterPreventAbsorbDetectRecoverstop it happeningRedundancyIsolation(bulkheads)keep working anywayRetries with backoffCircuit breakersGraceful degradationLoad sheddingTimeoutsnotice it happenedObservability(logs, metrics, traces)Anomaly detectionget back to normalFailoverDisaster recoveryRunbooksThe four jobs of a resilient systemFailure lifecycle: before → during → afterPreventAbsorbDetectRecoverstop it happeningRedundancyIsolation(bulkheads)keep working anywayRetries with backoffCircuit breakersGraceful degradationLoad sheddingTimeoutsnotice it happenedObservability(logs, metrics, traces)Anomaly detectionget back to normalFailoverDisaster recoveryRunbooks

3.1. Prevent

Prevention is about making the failure not happen in the first place. The dominant technique is redundancy: multiple instances of any component whose loss would be visible. Three application replicas behind a load balancer, a database with a synchronous replica in another availability zone, two independent power feeds into the rack. The principle is that single points of failure are the ones that take the system down, so the design removes them.

Isolation is the close cousin. A failure that is allowed to propagate everywhere is a failure that takes everything down with it. Service boundaries, separate connection pools per downstream, and separate thread pools per workload class (the “bulkhead” pattern, borrowed from shipbuilding) all exist to keep one component’s bad day from becoming the whole system’s bad day.

Prevention has a hard ceiling. You can pay for as many replicas as you like, but the failures that matter most are usually the ones nobody anticipated, and those slip through prevention by definition. Which is why the other three jobs exist.

3.2. Absorb

Absorbing failure means continuing to do useful work while something is wrong. The techniques in this category share a common shape: instead of crashing or hanging, the system does something less ambitious but still valuable.

Graceful degradation is the high-level pattern. When the recommendation service is down, the product page still loads with a static fallback. When the search backend is slow, the site falls back to a simpler keyword match. The user gets a worse experience instead of no experience.

Retries handle transient failures, which most failures turn out to be. The non-obvious part is that naive retries make the problem worse: a thundering herd of clients all retrying at the same moment can amplify a small outage into a large one. Real retry logic uses exponential backoff with jitter, and a budget for how many attempts are worth making.

Circuit breakers prevent the retry pattern from going wrong in the other direction. If a downstream is clearly failing, hammering it with requests wastes capacity on both sides. A circuit breaker watches the failure rate and, past a threshold, stops sending traffic for a cooldown period, then probes cautiously to see if recovery has happened. The name and the job are borrowed wholesale from electrical engineering.

Circuit breaker state machineClosed(traffic flows)Open(traffic blocked)Half-open(probe one)failure rate >thresholdcooldown expiresprobe succeedsprobe failsGoal: stop hammering a clearly-failing downstream; probe cautiously before resuming.Circuit breaker state machineClosed(traffic flows)Open(traffic blocked)Half-open(probe one)failure rate >thresholdcooldown expiresprobe succeedsprobe failsGoal: stop hammering a clearly-failing downstream; probe cautiously before resuming.

Bulkheads and timeouts keep one slow caller from consuming all the resources. A connection pool of unbounded size is a bug; a request handler without a timeout is a bug. Both will eventually find a way to take the system down.

Load shedding is the brute-force absorb technique: when the system is past capacity, deliberately reject requests rather than accept all of them and serve none well. The companion concept, backpressure, propagates the “I am full” signal up the call chain so callers slow down rather than pile on.

3.3. Detect

You cannot respond to a failure you do not know about. Detection is the discipline of knowing what the system is doing in enough detail to notice when it stops doing the right thing, ideally before users do.

The umbrella term is observability: the ability to ask new questions about a running system without shipping new code. The three classical pillars are logs (what happened), metrics (how much, how often, how fast), and traces (the path of one request through a distributed system). On modern systems these are increasingly emitted through a single instrumentation API, OpenTelemetry.

Detection has a hard problem at its core: most of the time, most signals look normal, and “normal” itself shifts with traffic patterns, deploys, and seasons. Static thresholds catch the obvious cases and miss the subtle ones; on a busy system they also generate enough false alarms to numb the on-call. The way out is to learn what normal looks like and alert on departures from it, which is the topic of anomaly detection and AIOps.

The detection layer also feeds the next one. The same signals that page on-call are what an engineer uses to decide whether to roll back, fail over, or wait, which is where recovery starts.

3.4. Recover

Recovery is the work of getting back to normal after a failure has already affected the system. The earlier three jobs reduce how often you need it; you still need it, because the failure mode you did not anticipate is the one you will eventually face.

Failover is the automatic version: when the primary fails, traffic shifts to a standby. This requires that the standby is actually warm (a cold replica that takes ten minutes to start is not a failover, it is a slow restart) and that the failover path itself has been tested under realistic conditions, not just in the design doc.

Disaster recovery is the broader plan for the worst cases: a region going dark, a database getting corrupted, ransomware encrypting the production cluster. The valuable artifacts are recovery time objective (RTO, how long you can be down) and recovery point objective (RPO, how much data you can afford to lose), both expressed as numbers the business has signed off on, with backup and restoration procedures that have been rehearsed against those numbers.

RTO and RPO on a failure timelineLast good backupFailure happensService restoredRPO (data lost)how much you can afford to loseRTO (downtime)how long you can be downBoth are durations the business signs off on, then the team rehearses against.RTO and RPO on a failure timelineLast good backupFailure happensService restoredRPO (data lost)how much you can afford to loseRTO (downtime)how long you can be downBoth are durations the business signs off on, then the team rehearses against.

Runbooks are the human side. When something is on fire at 3am, the operator should not be deriving the response from first principles. A runbook for a known failure mode (database failover, regional shift, certificate rotation) is a script someone wrote calmly during business hours, so the panicked version of themselves at 3am does not have to invent it.

The pattern across all three: recovery only counts if it has been practiced. A failover that has never been triggered, a backup that has never been restored, a runbook that has never been walked through are theatrical artifacts, not capabilities. Practice is a habit, and habits are a property of the team — which is the next piece.

4. Design for failure as a default

Every prevent/absorb/detect/recover technique above is hollow if the team building the system does not believe failures will happen. The most common resilience anti-pattern is not a missing circuit breaker or a thin runbook; it is a codebase, a roadmap, and a culture that quietly assume the happy path.

The assumption shows up in code that does not handle a downstream error because “that service is always up”, in a deploy that has no rollback path because “the change is small”, in a feature shipped without monitoring because “we’ll add it if there’s a problem”. Each is a local decision that saves an hour today and costs a day in six months. Resilient codebases are the ones where engineers, by default, ask what happens when the call fails and write the code for that branch before the code for the happy one. It’s a small reordering with a big downstream effect.

The mindset has a name in distributed systems: design for failure. Assume the disk fills, the network partitions, the dependency hangs, the deploy half-succeeds, the clock skews, the message arrives twice. Then design the system so each of those is a normal day rather than an incident. This is not pessimism; it is the realistic prior. At any scale beyond a toy system, every one of those things will happen to you, often in combinations you would not have predicted.

The cultural correlate is blameless postmortems: when something does go wrong, the response is to understand the failure and improve the system, not to find the engineer who pressed the button. A team that punishes individuals for incidents will get fewer reported incidents, which is the opposite of the goal. Treat incidents as the system telling you where to invest, and the platform gets more resilient over time.

Resilience is more team property than code property. The code is downstream of how the people building it think.

5. Verifying resilience by breaking things on purpose

A resilient design is a hypothesis. Until something breaks the system in a real way and the system responds the way the design predicted, you do not actually know whether the design works. The only honest answer is to break things on purpose, in controlled conditions, and watch.

That practice is chaos engineering. The original version, Netflix’s Chaos Monkey, killed production VMs at random during business hours; later tools expand the failure modes (network latency, packet loss, full region outages) and apply them to staging or production with appropriate guardrails. The point is empirical. The system either survives the injected failure or it does not, and either way you have learned something you could not have learned by reading the architecture diagram.

I cover the full picture, including the prerequisites most teams skip before they should be doing this at all, in chaos engineering. The short version for this post: chaos engineering is what turns “we believe the system is resilient” into “we have evidence the system is resilient”. Without it, the rest of the work in this post is half-finished.

6. The rest of the Reliability Engineering topic

The rest of the Reliability Engineering topic on this site fills in the parts this post deliberately keeps shallow.

Resilience is not a single technique, a single tool, or a single team’s job. It is the shape of a system, and of the people building it, that takes failure seriously enough to plan for it. The posts above are the pieces; this one is the frame.