Concurrent Hash Map: Sharded Lock Design

Key Insights

Sharded locking reduces contention by dividing a hash map into independent segments, each with its own mutex, allowing N concurrent writers instead of one
The optimal shard count typically ranges from 2x to 4x your expected thread count, balancing contention reduction against memory overhead and cache efficiency
Cross-shard operations like iteration and size tracking require careful design—either accept approximate results or implement strict lock ordering to prevent deadlocks

The Problem with Single-Lock Hash Maps

When you wrap a standard hash map with a single mutex, you create a serialization point that destroys concurrent performance. Every read and every write must acquire the same lock, meaning your 32-core server effectively becomes single-threaded for map operations.

Consider a caching layer handling 100,000 requests per second. If each request needs a cache lookup, and your lock acquisition takes just 100 nanoseconds under contention, you’ve already burned 10 milliseconds per second just waiting for locks. As thread count increases, contention grows non-linearly—threads spend more time spinning or sleeping than doing actual work.

The fundamental issue is that a single lock protects data that doesn’t need unified protection. Keys “foo” and “bar” are independent. There’s no reason a thread writing to “foo” should block a thread reading “bar.” Sharded locking exploits this independence.

Sharded Lock Architecture Overview

The sharded lock design partitions your hash map into N independent segments, each functioning as a complete mini-hash-map with its own mutex. When a thread needs to access a key, it hashes the key, uses modulo arithmetic to select a shard, and only locks that specific shard.

If you have 16 shards and 16 threads accessing uniformly distributed keys, the probability of any two threads contending drops dramatically compared to a single lock. You’ve transformed one hot lock into 16 cooler ones.

package shardmap

import (
    "hash/fnv"
    "sync"
)

const DefaultShardCount = 32

type Shard[K comparable, V any] struct {
    mu    sync.RWMutex
    items map[K]V
}

type ShardedMap[K comparable, V any] struct {
    shards    []*Shard[K, V]
    shardCount uint64
    hashFunc   func(K) uint64
}

func New[K comparable, V any](shardCount int, hashFunc func(K) uint64) *ShardedMap[K, V] {
    if shardCount <= 0 {
        shardCount = DefaultShardCount
    }
    
    shards := make([]*Shard[K, V], shardCount)
    for i := range shards {
        shards[i] = &Shard[K, V]{
            items: make(map[K]V),
        }
    }
    
    return &ShardedMap[K, V]{
        shards:     shards,
        shardCount: uint64(shardCount),
        hashFunc:   hashFunc,
    }
}

func (m *ShardedMap[K, V]) getShard(key K) *Shard[K, V] {
    hash := m.hashFunc(key)
    return m.shards[hash%m.shardCount]
}

// Helper for string keys
func StringHashFunc(key string) uint64 {
    h := fnv.New64a()
    h.Write([]byte(key))
    return h.Sum64()
}

Notice the use of sync.RWMutex instead of a plain sync.Mutex. This allows concurrent reads within the same shard—a significant optimization for read-heavy workloads. Each shard is a pointer to avoid false sharing between adjacent shards in the array.

Implementing Core Operations

The beauty of sharded maps lies in operation simplicity. Each method follows the same pattern: compute shard, acquire lock, perform operation, release lock.

func (m *ShardedMap[K, V]) Get(key K) (V, bool) {
    shard := m.getShard(key)
    shard.mu.RLock()
    val, ok := shard.items[key]
    shard.mu.RUnlock()
    return val, ok
}

func (m *ShardedMap[K, V]) Put(key K, value V) {
    shard := m.getShard(key)
    shard.mu.Lock()
    shard.items[key] = value
    shard.mu.Unlock()
}

func (m *ShardedMap[K, V]) Delete(key K) bool {
    shard := m.getShard(key)
    shard.mu.Lock()
    _, existed := shard.items[key]
    delete(shard.items, key)
    shard.mu.Unlock()
    return existed
}

func (m *ShardedMap[K, V]) GetOrPut(key K, value V) (V, bool) {
    shard := m.getShard(key)
    
    // Try read lock first (optimistic)
    shard.mu.RLock()
    if val, ok := shard.items[key]; ok {
        shard.mu.RUnlock()
        return val, true
    }
    shard.mu.RUnlock()
    
    // Upgrade to write lock
    shard.mu.Lock()
    defer shard.mu.Unlock()
    
    // Double-check after acquiring write lock
    if val, ok := shard.items[key]; ok {
        return val, true
    }
    
    shard.items[key] = value
    return value, false
}

The GetOrPut method demonstrates a common pattern: optimistic read followed by conditional write. This avoids write locks when the key already exists, which matters in caching scenarios where most calls hit existing entries.

Choosing the Right Shard Count

Shard count selection involves balancing three concerns: lock contention, memory overhead, and cache behavior.

Too few shards means threads still compete frequently. With 4 shards and 32 threads, you average 8 threads per shard—better than 32 on one lock, but still significant contention.

Too many shards wastes memory on lock structures and map overhead. Each shard carries the fixed cost of a mutex (typically 8-24 bytes depending on implementation) plus the hash map’s internal bookkeeping. With 1,000 shards holding 100 keys each, you’re paying substantial overhead.

Cache effects matter too. If shards are too numerous and sparsely populated, you get poor cache locality. If they’re too few and densely packed, you might still benefit from the reduced working set per operation.

My rule of thumb: start with 2-4x your expected concurrent thread count, rounded up to a power of two. Powers of two allow bitwise AND instead of modulo for shard selection, which is marginally faster:

// Fast shard selection when shardCount is power of 2
func (m *ShardedMap[K, V]) getShardFast(key K) *Shard[K, V] {
    hash := m.hashFunc(key)
    return m.shards[hash&(m.shardCount-1)] // Bitwise AND instead of modulo
}

For a typical web server with 16-32 request-handling goroutines, 32-64 shards works well. Measure under realistic load before optimizing further.

Handling Cross-Shard Operations

Single-key operations are straightforward. Cross-shard operations—Size(), Keys(), Clear()—require more thought.

The naive approach locks all shards:

func (m *ShardedMap[K, V]) Size() int {
    total := 0
    for _, shard := range m.shards {
        shard.mu.RLock()
        total += len(shard.items)
        shard.mu.RUnlock()
    }
    return total
}

This is technically safe but gives you a point-in-time inconsistent count. Between locking shard 0 and shard 15, other threads may have modified earlier shards. For many applications, this approximation is acceptable.

If you need a consistent snapshot, you must lock all shards simultaneously:

func (m *ShardedMap[K, V]) ConsistentSnapshot() map[K]V {
    // Lock all shards in order (prevents deadlock)
    for _, shard := range m.shards {
        shard.mu.RLock()
    }
    
    // Copy all data
    result := make(map[K]V)
    for _, shard := range m.shards {
        for k, v := range shard.items {
            result[k] = v
        }
    }
    
    // Unlock in reverse order (convention, not strictly necessary)
    for i := len(m.shards) - 1; i >= 0; i-- {
        m.shards[i].mu.RUnlock()
    }
    
    return result
}

Lock ordering is critical. If thread A locks shards [0, 1, 2] while thread B locks [2, 1, 0], deadlock becomes possible. Always acquire shard locks in index order.

For frequently-needed size tracking, consider an atomic counter updated on each Put/Delete:

type ShardedMapWithSize[K comparable, V any] struct {
    *ShardedMap[K, V]
    size atomic.Int64
}

func (m *ShardedMapWithSize[K, V]) Put(key K, value V) {
    shard := m.getShard(key)
    shard.mu.Lock()
    _, existed := shard.items[key]
    shard.items[key] = value
    shard.mu.Unlock()
    
    if !existed {
        m.size.Add(1)
    }
}

Performance Comparison and Benchmarks

Here’s a simple benchmark comparing single-lock, sharded, and sync.Map implementations:

func BenchmarkMaps(b *testing.B) {
    threadCounts := []int{1, 4, 8, 16, 32}
    
    for _, threads := range threadCounts {
        b.Run(fmt.Sprintf("SingleLock-%d", threads), func(b *testing.B) {
            m := NewSingleLockMap[string, int]()
            benchmarkMap(b, m, threads)
        })
        
        b.Run(fmt.Sprintf("Sharded32-%d", threads), func(b *testing.B) {
            m := New[string, int](32, StringHashFunc)
            benchmarkMap(b, m, threads)
        })
        
        b.Run(fmt.Sprintf("SyncMap-%d", threads), func(b *testing.B) {
            m := &sync.Map{}
            benchmarkSyncMap(b, m, threads)
        })
    }
}

func benchmarkMap(b *testing.B, m MapInterface[string, int], threads int) {
    var wg sync.WaitGroup
    opsPerThread := b.N / threads
    
    b.ResetTimer()
    for t := 0; t < threads; t++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            for i := 0; i < opsPerThread; i++ {
                key := fmt.Sprintf("key-%d-%d", id, i%1000)
                if i%10 == 0 {
                    m.Put(key, i)
                } else {
                    m.Get(key)
                }
            }
        }(t)
    }
    wg.Wait()
}

Typical results show sharded maps scaling nearly linearly up to the shard count, while single-lock maps plateau quickly. At 32 threads with 32 shards, expect 10-20x throughput improvement over a single mutex for mixed read-write workloads.

When to Use (and When Not To)

Use sharded maps when:

You have high concurrent access (8+ threads regularly accessing the map)
Keys are well-distributed (poor hash functions defeat sharding)
Operations are predominantly single-key
You can tolerate approximate cross-shard operations

Consider alternatives when:

Read-heavy workloads dominate: A single sync.RWMutex might suffice
You need lock-free guarantees: Look at lock-free hash maps or sync.Map
Your language provides optimized concurrent maps: Java’s ConcurrentHashMap, .NET’s ConcurrentDictionary
Cross-shard consistency is critical: You’ll pay the cost of global locking anyway

Go’s sync.Map is optimized for two patterns: write-once-read-many and disjoint key sets per goroutine. For general-purpose concurrent caching with mixed reads and writes, a well-tuned sharded map typically outperforms it.

Sharded locking isn’t clever—it’s a straightforward application of reducing contention through partitioning. That simplicity is its strength. The code is easy to understand, debug, and maintain. When you need concurrent map access and can’t use a language-provided solution, sharded locking should be your default choice.