Consistent Hashing: The Pattern That Saves You When Nodes Die

I used to think hashing was simple. You take a key, run it through a hash function, mod it by the number of nodes, and you're done — hash(key) % N. It distributes keys evenly, it's fast, and it works perfectly.

Until I watched a cache cluster lose 80% of its keys during a routine scale-up.

We added one node to a 3-node Redis cache. Simple change, right? N went from 3 to 4. But because of how modulo hashing works, almost every key now mapped to a different node. The cache hit rate dropped from 92% to 8% in seconds. The database buckled under the sudden load spike. It was a bad afternoon.

That's when I learned about consistent hashing. And once I understood it, I started seeing it everywhere — in distributed caches, database sharding, Kafka partitions, and CDN routing. This post is the guide I wish I had that day.

The problem with traditional hashing

Let's start with what most of us reach for first: hash(key) % N.

def get_node(key, nodes):
    return nodes[hash(key) % len(nodes)]

It's simple and distributes keys evenly when N is fixed. But the moment N changes — a node is added, a node crashes, a node is removed for maintenance — the modulo shifts for almost every key.

Traditional hashing remaps most keys on node change; consistent hashing minimizes the impact

With 3 nodes, hash(k5) % 3 = 2, so k5 goes to Node 2. Add a 4th node, and hash(k5) % 4 could be anything. In practice, when you add or remove a node, roughly (N-1)/N of all keys remap to a different node. With 3 nodes, that's ~67%. With 100 nodes, it's ~99%.

In a cache, that means a massive cascade of cache misses. In a sharded database, that means data is suddenly on the wrong shard and queries break. This is the problem consistent hashing solves.

What consistent hashing actually is

The core idea: instead of hashing keys to node indices, you hash both keys and nodes onto a circular space (the "hash ring").

Consistent hashing: keys are assigned to the next node in clockwise order on the hash ring

Here's how it works:

Imagine a circle representing the range of the hash function (e.g., 0 to 2³²-1 for MD5, or 0 to 2⁶³-1 for MurmurHash).
Place each node on the circle at position hash(node_id).
Place each key on the circle at position hash(key).
To find which node owns a key, walk clockwise on the circle until you hit a node. That's the owner.

When a node is added or removed, only the keys that were mapped to that node (and now map to its clockwise neighbor) need to move. Everything else stays put.

With 3 nodes, adding a 4th remaps roughly 1/4 of the keys — not 3/4. That's the entire point.

The virtual node trick

There's a subtle problem with the basic approach. If you only have a few nodes, their positions on the ring might be uneven. One node could end up responsible for 60% of the keys just because of where its hash landed. This is called hash skew.

The fix is virtual nodes (also called "vnodes"). Each physical node is represented by many (often 100–200) virtual nodes spread around the ring.

Virtual nodes: each physical node appears multiple times on the ring, reducing skew

When you add a physical node, you add 100+ virtual nodes at different positions. This smooths out the distribution dramatically. Each physical node ends up owning roughly 1/N of the ring, even with a small number of nodes.

Cassandra, Memcached, and DynamoDB all use virtual nodes. In Cassandra, the default is 256 virtual nodes per physical node. You don't need to implement this yourself — most libraries handle it — but it's important to understand why it's there.

When to actually use consistent hashing

Here's where it gets practical. Consistent hashing isn't a magic bullet for everything. It solves a specific problem: minimizing key remapping when the set of nodes changes.

Database sharding

This is the classic use case. You have data sharded across multiple database instances, and you want to add or remove shards without migrating most of your data.

Without consistent hashing	With consistent hashing
Adding a shard → rebalance ~99% of data	Adding a shard → rebalance ~1/N of data
Shrinking cluster = massive migration	Shrinking cluster = minimal migration

When it fits: Your shard key is stable (user_id, tenant_id), and you need to add/remove shards regularly. MongoDB uses consistent hashing for shard distribution.

When it doesn't: If your shard count is fixed and you use range-based sharding instead (e.g., user_id 0–9999 → shard A, 10000–19999 → shard B), consistent hashing adds complexity without benefit.

Distributed caching (Redis, Memcached)

This is where I learned the lesson. A cache cluster using modulo hashing will have a catastrophic miss storm on any topology change.

With consistent hashing:

A node failure means only its keys are lost (they remap to the next node clockwise)
Adding a node only steals keys from one neighbor
The rest of the cache stays warm

Real example: You have a 10-node Redis cluster caching user profiles. Node 7 crashes. Without consistent hashing, 10% of keys remap to random nodes — but actually, with modulo, all keys remap, so the entire cache is effectively cold. With consistent hashing, only ~10% of keys (the ones that were on Node 7) are lost. The other 90% stay cached.

Important note: This assumes a client-side consistent hashing implementation or a proxy that supports it. Redis Cluster uses a different approach (hash slots), but the principle is similar. AWS ElastiCache for Memcached supports consistent hashing natively.

Kafka partition assignment

Kafka doesn't use consistent hashing by that name, but the partition assignment logic follows the same principle. When a new consumer joins a consumer group, Kafka rebalances partitions across consumers. The goal is to minimize partition reassignment — only move partitions from the departing consumer to the new one, not reshuffle everything.

Kafka's sticky partition assignment (the default since Kafka 2.4) is essentially consistent hashing for consumers. It tries to keep existing assignments stable and only reassign what's necessary.

When it fits: Any system where you have partitioned data and consumers that can come and go.

CDN and load balancer routing

Content delivery networks use consistent hashing to route requests to origin servers. If you're building a custom load balancer that needs to route sessions to specific backend servers (for sticky sessions), consistent hashing lets you add/remove backends without breaking most existing sessions.

Real example: You're routing user requests to backend servers based on user_id. With consistent hashing, if you add a new backend server, most users still hit the same server they were talking to before. Their in-memory session data is still there.

What it doesn't help with

Consistent hashing is not a solution for:

Hot keys — if one key gets 90% of the traffic, it doesn't matter how you hash, that node will be overloaded. You need a different strategy (key splitting, dedicated caching).
Data replication — consistent hashing tells you where a key lives, not how many copies exist. You still need a replication strategy.
Strong consistency — knowing which node owns a key doesn't guarantee that node has the latest value. That's a separate problem (consensus, quorums, etc.).

The trade-offs nobody tells you about

Memory overhead of virtual nodes

More virtual nodes = better distribution, but also more memory to track the ring. With 200 virtual nodes per physical node and 100 physical nodes, you're tracking 20,000 points on the ring. This is fine for most systems, but it's not free.

Lookup complexity

Finding the right node means searching the ring. A naive implementation is O(N), but you can use a sorted data structure (balanced tree, skip list) to get O(log N). In practice, with hundreds of virtual nodes, this is negligible.

It's not always available out of the box

Many tools don't expose consistent hashing directly. Redis Cluster uses hash slots (16384 fixed slots, which is conceptually similar). PostgreSQL sharding extensions (like Citus) handle this internally. You might need to implement it yourself if you're building a custom system.

A minimal implementation

Here's a working Python implementation to make it concrete:

import hashlib
import bisect

class ConsistentHash:
    def __init__(self, nodes=None, virtual_replicas=150):
        self.virtual_replicas = virtual_replicas
        self.ring = []        # sorted list of hash positions
        self.node_map = {}    # hash position → node
        if nodes:
            for node in nodes:
                self.add_node(node)

    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node):
        for i in range(self.virtual_replicas):
            vnode = f"{node}#{i}"
            h = self._hash(vnode)
            bisect.insort(self.ring, h)
            self.node_map[h] = node

    def remove_node(self, node):
        for i in range(self.virtual_replicas):
            vnode = f"{node}#{i}"
            h = self._hash(vnode)
            idx = bisect.bisect_left(self.ring, h)
            if idx < len(self.ring) and self.ring[idx] == h:
                del self.ring[idx]
                del self.node_map[h]

    def get_node(self, key):
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect.bisect_left(self.ring, h)
        if idx == len(self.ring):
            idx = 0  # wrap around
        return self.node_map[self.ring[idx]]

The key insight: bisect.bisect_left finds the next clockwise node on the ring. The ring wraps around (if idx == len(ring), go back to position 0).

How to choose: a quick decision guide

Your situation	Use consistent hashing?
Cache cluster that scales up/down	Yes — prevents cache miss storms
Database sharding with changing shard count	Yes — minimizes data migration
Kafka partition assignment	Built-in — sticky assignment does this
CDN/load balancer routing	Yes — stable session affinity
Fixed shard count, range-based	No — simpler approaches work
Single-node or active-passive setup	No — overkill
Hot key problem	No — need key splitting instead

The mistake I actually made

Going back to that cache incident: the team knew we needed to scale. We added a node to the Redis cluster. The problem wasn't adding the node — it was that our client library used hash(key) % len(nodes) to decide which Redis instance to talk to.

When we added the 4th node, every client suddenly computed different targets for almost every key. The fix was to switch to a client library that supported consistent hashing (or in our case, we migrated to Redis Cluster which handles this differently with hash slots).

The real lesson: if your system needs to add or remove nodes at runtime, and you're using a hash-based routing strategy, consistent hashing isn't optional. It's the difference between a smooth scale-up and a production incident.

Summary

Traditional hashing (hash % N) remaps almost all keys when N changes — catastrophic for caches and sharded databases
Consistent hashing maps keys and nodes onto a ring; only keys near the changed node need to remap
Virtual nodes fix hash skew by giving each physical node many positions on the ring
Use it for: distributed caches, database sharding, CDN routing, consumer group rebalancing
Don't use it for: fixed topologies, hot key problems, or when you need strong consistency guarantees
Most mature distributed systems (Cassandra, DynamoDB, Memcached, Kafka) use it internally — you're probably already relying on it

If you're building a distributed system where nodes come and go, consistent hashing is one of those patterns that pays for itself the first time you need to scale.

References

Consistent Hashing and Random Trees (Karger et al., 1997) — the original paper that introduced consistent hashing at MIT
Dynamo: Amazon's Highly Available Key-value Store — uses consistent hashing with virtual nodes in production
Cassandra Architecture — Data Distribution — how Cassandra uses consistent hashing with 256 vnodes per node
Redis Cluster Specification — Redis uses hash slots (16384) as a similar alternative to consistent hashing
Martin Kleppmann — Designing Data-Intensive Applications — chapter 6 covers partitioning strategies including consistent hashing
Kafka Consumer Rebalance Protocol — sticky partition assignment follows consistent hashing principles
AWS ElastiCache for Memcached — Consistent Hashing — how AWS implements it for managed Memcached