Estimated reading time: 9 minutes By: Editorial Team Published: January 22, 2024
Overview
In distributed systems, failures are not exceptional events — they are normal operating conditions. The difference between a resilient platform and a fragile one is not whether things break, but how the system behaves when they do.
Key takeaways
- Design for failure from day one; retrofitting resilience is significantly harder.
- Circuit breakers prevent cascading failures more effectively than retries alone.
- Bulkheads limit blast radius so one failing component cannot sink the whole system.
- Observability is the prerequisite for everything else — you cannot fix what you cannot see.
Retries and exponential backoff
The simplest resilience pattern is retrying a failed operation. Retries handle transient failures — a momentary network glitch, a brief overload on a downstream service — without user-visible errors.
Retry with exponential backoff and jitter
Immediate retries can worsen overload conditions. Exponential backoff spaces retries at increasing intervals:
- Attempt 1: immediate
- Attempt 2: wait 1 second
- Attempt 3: wait 2 seconds
- Attempt 4: wait 4 seconds
Jitter adds randomness to the backoff interval so that multiple clients don't retry in synchronized waves, which would recreate the overload.
Retry limits and idempotency
Set a maximum retry count and a total timeout. Retry only operations that are safe to repeat (idempotent). A payment charge is not idempotent — retrying it without a deduplication key risks a double charge.
Circuit breakers
When a downstream service is failing, retrying every request wastes resources and delays the caller's failure response. A circuit breaker tracks failure rates and stops sending requests to a service that is clearly unavailable.
Circuit breaker states
- Closed — requests pass through normally; failures are counted
- Open — requests fail immediately without attempting the downstream call; the circuit "trips" when failures exceed a threshold
- Half-open — after a cooldown period, one trial request is allowed through; if it succeeds, the circuit closes; if it fails, the circuit remains open
Circuit breakers are available as libraries (Resilience4j for Java, Polly for .NET, opossum for Node.js) and as service mesh features (Istio, Envoy).
Bulkheads
A bulkhead isolates failures in one component from spreading to others, the same way watertight compartments in a ship prevent a single breach from sinking the vessel.
Thread pool isolation
Assign separate thread pools or connection pools to different downstream dependencies. If the database becomes slow and exhausts its pool, API calls to third-party services continue to function on their own pool.
Service-level isolation
In a microservices architecture, a bulkhead can mean deploying a dedicated instance of a service for high-priority traffic so that a surge in low-priority requests cannot degrade critical paths.
Timeouts
Every network call must have a timeout. A call that never returns holds a thread, a connection, and often a user session. Without timeouts, a slow dependency causes threads to accumulate until the calling service is fully occupied — a pattern called thread exhaustion.
Set timeouts at every level:
- Connection timeout — how long to wait to establish a connection
- Read timeout — how long to wait for data once connected
- Overall request timeout — the total budget for the entire operation including retries
Timeout values should be derived from observed p99 latencies, not guessed.
Fallbacks and graceful degradation
When a dependency fails, the system should have a defined behavior rather than propagating the error to end-users. Common fallback strategies:
- Cached response — serve the last known-good value, marked as potentially stale
- Default value — return an empty list or zero rather than an error
- Degraded feature — disable the affected feature rather than failing the whole page
- Fail-open — in non-critical paths, allow the operation to proceed with reduced functionality
Observability
Resilience patterns catch failures — observability surfaces them. Without visibility into what is failing, where, and how often, even well-designed resilience mechanisms become invisible.
The three pillars
- Metrics — numeric time-series data: request rates, error rates, latency percentiles (p50, p95, p99), resource utilization
- Logs — structured event records with context: request IDs, user IDs, operation names, durations, error codes
- Traces — end-to-end records of a request's path through a distributed system, showing which service added latency where
Use a correlation ID (a single random token) that flows through every service call for a given request, linking metrics, logs, and traces together.
SLOs and error budgets
Define service level objectives (SLOs) as the target reliability for each user-facing operation. An error budget is the acceptable amount of failure implied by the SLO — if the SLO is 99.9% availability, the error budget is 0.1% of requests.
Error budgets make reliability conversations concrete: when the budget is consumed, new feature work pauses until reliability is restored.
Chaos engineering
Chaos engineering validates that resilience patterns work as designed by intentionally injecting failures in a controlled way.
A basic chaos practice:
- Identify a hypothesis: "If the recommendation service becomes unavailable, the homepage will still load with a default product list."
- Inject the failure in a staging or low-traffic production environment.
- Observe whether the hypothesis holds.
- Fix any unexpected behaviors before the failure occurs naturally.
Tools: Chaos Monkey (Netflix), Gremlin, AWS Fault Injection Simulator.
Conclusion
Resilience is not a feature to add after a system works — it is a design requirement from the start. The patterns in this article are well-understood and available in every major language and platform. The investment in applying them is measured in hours; the cost of not applying them is measured in outages.