Scalable System Design Basics

Estimated reading time: 9 minutes By: Editorial Team Published: March 5, 2024

Overview

Scalability is not a single property — it is the combination of design decisions that allow a system to handle more work without a proportional increase in cost, latency, or failure rate. Most scalability problems are predictable and avoidable when addressed during design rather than under production pressure.

Key takeaways

Horizontal scaling is more resilient than vertical scaling for most workloads.
Statelessness is the enabling property for horizontal scaling.
Caching is the highest-return optimization available, applied at the right layer.
Database scaling is the hardest problem; deferring it is a common and costly mistake.

Load distribution

A system that runs on a single server has a single point of failure and a hard ceiling on capacity. Distributing load across multiple instances removes both constraints.

Horizontal vs. vertical scaling

Vertical scaling means giving a single server more CPU, memory, or storage. It is simple to implement but has hard physical limits and creates a single failure point.

Horizontal scaling means adding more servers and distributing work across them. It is more complex to implement but has no theoretical ceiling and allows for graceful degradation when individual nodes fail.

Most modern cloud architectures default to horizontal scaling with identical stateless instances behind a load balancer.

Load balancer patterns

Round robin — requests are distributed sequentially across instances; works well when requests are similar in cost
Least connections — routes to the instance with the fewest active connections; better for variable-cost requests
Sticky sessions — routes a user to the same instance for the duration of their session; useful when statelessness is not yet achievable but introduces coupling

Statelessness

A stateless service does not store any information about previous requests. Each request contains all the data the service needs to respond. This property is what makes horizontal scaling straightforward.

Making a service stateless

Move all persistent state out of the application tier:

Session data — move to a shared cache like Redis
Uploaded files — move to object storage like S3
Database connections — use a connection pool shared across instances
Configuration — read from environment variables or a configuration service, not local files

Once the service is stateless, adding or removing instances requires no coordination.

Caching

Caching stores the result of an expensive operation so future requests can skip the computation. It is the most impactful single optimization for read-heavy systems.

Cache layers

CDN (edge cache) — serves static assets and full pages from locations near the user; reduces origin server load dramatically
Application cache — stores the results of database queries or API calls in memory (Redis, Memcached)
Database query cache — most databases offer this natively; useful but harder to invalidate correctly
Client cache — HTTP Cache-Control headers allow browsers to cache responses locally

Cache invalidation

The hardest part of caching is knowing when to expire entries. Common strategies:

TTL-based expiry — entries expire after a fixed time; simple but can serve stale data
Write-through invalidation — entries are invalidated when the underlying data changes; more accurate but requires coordination
Versioned keys — cache entries include a version identifier; old versions are ignored rather than deleted

Database scaling

The database tier is the most common bottleneck in scaled systems. Planning for database scale early prevents the most painful migrations.

Read replicas

For read-heavy workloads, directing all writes to a primary instance and reads to one or more replicas can multiply read capacity without changing the data model. Replication lag must be accounted for in applications that need consistent reads.

Sharding

Sharding partitions data horizontally across multiple database instances, each holding a subset of rows. It scales write capacity but introduces complexity: cross-shard queries are expensive, and rebalancing shards as data grows requires careful planning.

CQRS (Command Query Responsibility Segregation)

CQRS separates the data models used for reads and writes. The write model is optimized for consistency; the read model is optimized for query performance. This pattern works well in high-scale systems but adds significant complexity and is not appropriate for most applications below a certain scale.

Connection pooling

Database connections are expensive to establish. A connection pool maintains a set of open connections and reuses them across requests. Without pooling, a spike in traffic can exhaust database connection limits and cause cascading failures. Use a connection pool in every application that talks to a database.

Design for failure

Scalable systems are also resilient systems. Build with the assumption that any component will fail eventually:

Retries with backoff — transient failures should be retried with exponential backoff and jitter, not immediately
Circuit breakers — stop calling a failing service rather than accumulating queued requests
Timeouts — every network call needs a timeout; hanging calls block threads and cascade failures
Graceful degradation — define what the system should do when a dependency is unavailable

Conclusion

Scalability comes from designing around the constraints that bite first: stateful services, single database instances, and absent caching. Addressing these early — before traffic materializes — is dramatically cheaper than retrofitting them under load.