Concurrent Queue: Lock-Free MPMC Queue

Key Insights

Lock-free MPMC queues eliminate mutex contention by using atomic compare-and-swap operations, enabling multiple threads to enqueue and dequeue simultaneously without blocking each other.
A bounded ring buffer with per-slot sequence numbers provides the foundation for a practical lock-free design, allowing producers and consumers to coordinate through atomic state transitions rather than locks.
Lock-free doesn’t mean wait-free—threads may spin under contention, and cache line padding is essential to prevent false sharing from destroying your performance gains.

Introduction to MPMC Queues

Multi-Producer Multi-Consumer (MPMC) queues are fundamental building blocks in concurrent systems. Thread pools use them to distribute work. Event systems route messages through them. Logging frameworks buffer writes before flushing to disk. Any time multiple threads need to safely exchange data, a concurrent queue sits at the center.

The traditional approach uses a mutex to protect the queue’s internal state. One thread acquires the lock, does its work, and releases it. Simple and correct, but under high contention, threads spend more time waiting for locks than doing actual work.

Lock-free queues take a different approach. Instead of mutual exclusion, they use atomic operations to coordinate access. Multiple threads can make progress simultaneously, and no thread can block another indefinitely. The result is dramatically better throughput under contention—exactly the scenario where you need performance most.

The Problem with Locking

Consider a basic thread-safe queue using a mutex:

template <typename T>
class MutexQueue {
    std::queue<T> queue_;
    std::mutex mutex_;
    
public:
    void enqueue(T value) {
        std::lock_guard<std::mutex> lock(mutex_);
        queue_.push(std::move(value));
    }
    
    std::optional<T> dequeue() {
        std::lock_guard<std::mutex> lock(mutex_);
        if (queue_.empty()) return std::nullopt;
        T value = std::move(queue_.front());
        queue_.pop();
        return value;
    }
};

This works correctly but has serious problems at scale. When eight threads hammer this queue, seven are always waiting. The mutex becomes a serialization point—you’ve effectively single-threaded your concurrent system.

Worse, operating system scheduling introduces unpredictability. A thread holding the lock might get preempted, blocking all other threads until it resumes. Priority inversion can occur when a high-priority thread waits on a lock held by a low-priority thread. These issues compound under load, creating latency spikes that are difficult to diagnose and impossible to eliminate with mutexes.

Core Concepts: CAS and Memory Ordering

Lock-free programming relies on Compare-And-Swap (CAS), an atomic operation that updates a value only if it matches an expected value. If another thread modified the value first, CAS fails, and you retry with the new value.

template <typename T>
bool compare_and_swap(std::atomic<T>& target, T& expected, T desired) {
    return target.compare_exchange_weak(
        expected, 
        desired,
        std::memory_order_acq_rel,  // success ordering
        std::memory_order_acquire   // failure ordering
    );
}

// Typical CAS loop pattern
void atomic_increment(std::atomic<int>& counter) {
    int expected = counter.load(std::memory_order_relaxed);
    while (!counter.compare_exchange_weak(
            expected, 
            expected + 1,
            std::memory_order_release,
            std::memory_order_relaxed)) {
        // expected is updated with current value on failure
        // loop retries with new expected value
    }
}

Memory ordering determines what other threads observe. memory_order_acquire ensures subsequent reads see all writes that happened before a corresponding release. memory_order_release ensures prior writes are visible to threads that acquire. Getting this wrong leads to data races that manifest only under specific timing conditions—the worst kind of bug.

The ABA problem is a classic pitfall: thread A reads value X, gets preempted, thread B changes X to Y then back to X, thread A resumes and CAS succeeds even though the state changed. Bounded queues with sequence numbers sidestep this issue elegantly.

Bounded Ring Buffer Design

The most practical lock-free MPMC queue uses a bounded ring buffer. Fixed size eliminates memory allocation in the hot path and naturally handles the ABA problem through sequence numbers.

template <typename T, size_t Capacity>
class LockFreeMPMCQueue {
    static_assert((Capacity & (Capacity - 1)) == 0, 
                  "Capacity must be power of 2");
    
    struct Slot {
        std::atomic<size_t> sequence;
        typename std::aligned_storage<sizeof(T), alignof(T)>::type storage;
        
        T* ptr() { return reinterpret_cast<T*>(&storage); }
    };
    
    static constexpr size_t CACHE_LINE = 64;
    
    alignas(CACHE_LINE) std::atomic<size_t> head_{0};
    alignas(CACHE_LINE) std::atomic<size_t> tail_{0};
    alignas(CACHE_LINE) Slot slots_[Capacity];
    
public:
    LockFreeMPMCQueue() {
        for (size_t i = 0; i < Capacity; ++i) {
            slots_[i].sequence.store(i, std::memory_order_relaxed);
        }
    }
    
    ~LockFreeMPMCQueue() {
        // Clean up any remaining elements
        T value;
        while (dequeue(value)) {}
    }
    
    static constexpr size_t capacity() { return Capacity; }
};

Each slot has a sequence number indicating its state. When sequence == position, the slot is ready for a producer. When sequence == position + 1, it contains data ready for a consumer. This state machine coordinates access without locks.

The power-of-two capacity enables fast modulo operations using bitwise AND. The alignas(CACHE_LINE) declarations prevent false sharing between head and tail—we’ll discuss why this matters shortly.

Implementation: Enqueue and Dequeue Operations

The enqueue operation claims a slot by atomically advancing the tail, then writes data and updates the sequence:

template <typename T, size_t Capacity>
bool LockFreeMPMCQueue<T, Capacity>::enqueue(const T& value) {
    size_t pos = tail_.load(std::memory_order_relaxed);
    
    for (;;) {
        Slot& slot = slots_[pos & (Capacity - 1)];
        size_t seq = slot.sequence.load(std::memory_order_acquire);
        intptr_t diff = static_cast<intptr_t>(seq) - 
                        static_cast<intptr_t>(pos);
        
        if (diff == 0) {
            // Slot is ready for writing
            if (tail_.compare_exchange_weak(
                    pos, pos + 1, 
                    std::memory_order_relaxed)) {
                // We claimed this slot
                new (slot.ptr()) T(value);
                slot.sequence.store(pos + 1, std::memory_order_release);
                return true;
            }
            // CAS failed, another producer claimed it, retry
        } else if (diff < 0) {
            // Queue is full
            return false;
        } else {
            // Slot was claimed by another producer, reload tail
            pos = tail_.load(std::memory_order_relaxed);
        }
    }
}

template <typename T, size_t Capacity>
bool LockFreeMPMCQueue<T, Capacity>::dequeue(T& value) {
    size_t pos = head_.load(std::memory_order_relaxed);
    
    for (;;) {
        Slot& slot = slots_[pos & (Capacity - 1)];
        size_t seq = slot.sequence.load(std::memory_order_acquire);
        intptr_t diff = static_cast<intptr_t>(seq) - 
                        static_cast<intptr_t>(pos + 1);
        
        if (diff == 0) {
            // Slot contains data ready for reading
            if (head_.compare_exchange_weak(
                    pos, pos + 1,
                    std::memory_order_relaxed)) {
                // We claimed this slot
                value = std::move(*slot.ptr());
                slot.ptr()->~T();
                slot.sequence.store(
                    pos + Capacity, 
                    std::memory_order_release);
                return true;
            }
            // CAS failed, another consumer claimed it, retry
        } else if (diff < 0) {
            // Queue is empty
            return false;
        } else {
            // Slot was claimed by another consumer, reload head
            pos = head_.load(std::memory_order_relaxed);
        }
    }
}

The key insight is the sequence number dance. After a consumer finishes, it sets the sequence to pos + Capacity, which will equal the position when a producer wraps around to reuse the slot. This creates a clean handoff without any possibility of ABA issues.

Pitfalls and Edge Cases

False sharing is the silent performance killer. When head and tail share a cache line, every update invalidates the cache on all cores, forcing expensive memory fetches:

// BAD: head and tail on same cache line
struct BadQueue {
    std::atomic<size_t> head;  // These are likely adjacent
    std::atomic<size_t> tail;  // in memory
};

// GOOD: Explicit padding prevents false sharing
struct GoodQueue {
    alignas(64) std::atomic<size_t> head;
    char pad1[64 - sizeof(std::atomic<size_t>)];
    alignas(64) std::atomic<size_t> tail;
    char pad2[64 - sizeof(std::atomic<size_t>)];
};

compare_exchange_weak can fail spuriously even when the comparison should succeed. This is intentional—it allows more efficient implementation on some architectures. Always use it in a loop.

Starvation is possible under extreme contention. A thread might repeatedly lose CAS races and never make progress. In practice, this is rarely a problem because CAS failures mean other threads are succeeding—work is getting done.

Memory reclamation in unbounded lock-free structures is notoriously difficult (see hazard pointers, epoch-based reclamation). Bounded queues avoid this entirely—another reason to prefer them.

Benchmarks and When to Use

Here’s a simple benchmark comparing queue implementations:

void benchmark_queue(auto& queue, int producers, int consumers, 
                     int ops_per_thread) {
    std::atomic<bool> start{false};
    std::atomic<int> completed{0};
    std::vector<std::thread> threads;
    
    for (int i = 0; i < producers; ++i) {
        threads.emplace_back([&, i] {
            while (!start.load()) {}
            for (int j = 0; j < ops_per_thread; ++j) {
                while (!queue.enqueue(i * ops_per_thread + j)) {
                    std::this_thread::yield();
                }
            }
            completed.fetch_add(1);
        });
    }
    
    for (int i = 0; i < consumers; ++i) {
        threads.emplace_back([&] {
            while (!start.load()) {}
            int value;
            int consumed = 0;
            while (consumed < ops_per_thread) {
                if (queue.dequeue(value)) consumed++;
                else std::this_thread::yield();
            }
        });
    }
    
    auto begin = std::chrono::high_resolution_clock::now();
    start.store(true);
    for (auto& t : threads) t.join();
    auto end = std::chrono::high_resolution_clock::now();
    
    auto ms = std::chrono::duration_cast<
        std::chrono::milliseconds>(end - begin).count();
    std::cout << "Throughput: " 
              << (producers * ops_per_thread * 1000 / ms) 
              << " ops/sec\n";
}

Typical results with 8 producers and 8 consumers show lock-free queues achieving 3-10x higher throughput than mutex-based alternatives. The gap widens with more threads.

Use lock-free MPMC when: You have multiple producers AND multiple consumers, contention is high, and you can tolerate bounded capacity.

Use mutex-based queues when: Simplicity matters more than peak performance, contention is low, or you need unbounded capacity.

Use SPSC (single-producer single-consumer) queues when: Your architecture allows it—they’re simpler and faster than MPMC.

Lock-free programming is powerful but demanding. Get the fundamentals right, benchmark under realistic conditions, and don’t optimize prematurely. A well-tuned lock-free MPMC queue can transform your system’s throughput characteristics—but only if contention was your actual bottleneck.