Lock-Free Data Structures: CAS-Based Algorithms

Key Insights

Lock-free data structures eliminate deadlocks and reduce contention but introduce complexity through retry loops, the ABA problem, and memory reclamation challenges that require careful handling.
Compare-And-Swap (CAS) forms the foundation of lock-free programming, but understanding memory ordering semantics is essential to avoid subtle bugs that only manifest under high concurrency.
Lock-free structures aren’t universally faster than mutex-based alternatives—they excel in high-contention scenarios with short critical sections but can underperform when contention is low or operations are complex.

The Problem with Locks

Traditional mutex-based synchronization works well until it doesn’t. Deadlocks emerge when multiple threads acquire locks in different orders. Priority inversion occurs when a high-priority thread waits for a low-priority thread holding a lock. Contention bottlenecks appear when many threads compete for the same lock, serializing what should be parallel work.

Lock-free programming offers an alternative. A data structure is lock-free if at least one thread makes progress in a finite number of steps, regardless of what other threads do. This is weaker than wait-free (every thread makes progress) but stronger than obstruction-free (progress only when running in isolation).

Lock-free structures make sense when you have high contention, short critical sections, and can’t tolerate the latency spikes that lock convoys cause. They don’t make sense when operations are complex, contention is low, or you lack the expertise to implement and verify them correctly. The complexity cost is real—bugs in lock-free code often manifest only under specific timing conditions, making them notoriously difficult to debug.

Compare-And-Swap Fundamentals

CAS is an atomic instruction that compares a memory location to an expected value and, only if they match, replaces it with a new value. The entire operation happens atomically—no other thread can observe an intermediate state.

At the hardware level, x86 provides CMPXCHG, ARM has LDXR/STXR pairs, and other architectures offer similar primitives. These instructions typically lock the cache line during execution, ensuring atomicity without requiring a full memory barrier.

Memory ordering matters critically. C++ provides several options through std::memory_order:

#include <atomic>

template<typename T>
class AtomicCounter {
    std::atomic<T> value_{0};
    
public:
    T increment() {
        T expected = value_.load(std::memory_order_relaxed);
        while (!value_.compare_exchange_weak(
            expected,
            expected + 1,
            std::memory_order_release,  // success ordering
            std::memory_order_relaxed   // failure ordering
        )) {
            // expected is updated to current value on failure
            // retry with new expected value
        }
        return expected + 1;
    }
    
    T get() const {
        return value_.load(std::memory_order_acquire);
    }
};

The compare_exchange_weak variant can spuriously fail even when the comparison should succeed, but it’s faster on some architectures. Use compare_exchange_strong when you’re not already in a retry loop. The pattern above—load, compute, CAS, retry on failure—is the fundamental building block of lock-free algorithms.

Building a Lock-Free Stack (Treiber Stack)

The Treiber stack, published in 1986, remains the canonical example of a lock-free data structure. It’s a singly-linked list where push and pop operations modify only the head pointer using CAS.

#include <atomic>
#include <optional>

template<typename T>
class TreiberStack {
    struct Node {
        T data;
        Node* next;
        
        explicit Node(T value) : data(std::move(value)), next(nullptr) {}
    };
    
    std::atomic<Node*> head_{nullptr};
    
public:
    void push(T value) {
        Node* new_node = new Node(std::move(value));
        new_node->next = head_.load(std::memory_order_relaxed);
        
        while (!head_.compare_exchange_weak(
            new_node->next,
            new_node,
            std::memory_order_release,
            std::memory_order_relaxed
        )) {
            // new_node->next is updated to current head on failure
            // loop retries with correct next pointer
        }
    }
    
    std::optional<T> pop() {
        Node* old_head = head_.load(std::memory_order_acquire);
        
        while (old_head != nullptr) {
            Node* new_head = old_head->next;
            
            if (head_.compare_exchange_weak(
                old_head,
                new_head,
                std::memory_order_acquire,
                std::memory_order_relaxed
            )) {
                T value = std::move(old_head->data);
                // WARNING: Can't safely delete old_head yet (ABA problem)
                // delete old_head;
                return value;
            }
            // old_head is updated to current head on failure
        }
        
        return std::nullopt;
    }
};

Notice the commented-out delete. We’ll address why this is dangerous in the ABA problem section. The key insight is that both operations only modify the head pointer atomically—there’s no window where the structure is in an inconsistent state.

Lock-Free Queue (Michael-Scott Queue)

A lock-free queue is more complex because it has two ends. The Michael-Scott queue uses separate head and tail pointers, with a sentinel node to simplify edge cases.

template<typename T>
class MSQueue {
    struct Node {
        std::atomic<T*> data;
        std::atomic<Node*> next;
        
        Node() : data(nullptr), next(nullptr) {}
    };
    
    std::atomic<Node*> head_;
    std::atomic<Node*> tail_;
    
public:
    MSQueue() {
        Node* sentinel = new Node();
        head_.store(sentinel, std::memory_order_relaxed);
        tail_.store(sentinel, std::memory_order_relaxed);
    }
    
    void enqueue(T value) {
        Node* new_node = new Node();
        T* new_data = new T(std::move(value));
        new_node->data.store(new_data, std::memory_order_relaxed);
        
        while (true) {
            Node* tail = tail_.load(std::memory_order_acquire);
            Node* next = tail->next.load(std::memory_order_acquire);
            
            if (tail == tail_.load(std::memory_order_acquire)) {
                if (next == nullptr) {
                    // Tail is pointing to last node, try to link new node
                    if (tail->next.compare_exchange_weak(
                        next, new_node,
                        std::memory_order_release,
                        std::memory_order_relaxed
                    )) {
                        // Try to swing tail to new node (may fail, that's OK)
                        tail_.compare_exchange_strong(
                            tail, new_node,
                            std::memory_order_release,
                            std::memory_order_relaxed
                        );
                        return;
                    }
                } else {
                    // Tail is lagging, help advance it
                    tail_.compare_exchange_weak(
                        tail, next,
                        std::memory_order_release,
                        std::memory_order_relaxed
                    );
                }
            }
        }
    }
    
    std::optional<T> dequeue() {
        while (true) {
            Node* head = head_.load(std::memory_order_acquire);
            Node* tail = tail_.load(std::memory_order_acquire);
            Node* next = head->next.load(std::memory_order_acquire);
            
            if (head == head_.load(std::memory_order_acquire)) {
                if (head == tail) {
                    if (next == nullptr) {
                        return std::nullopt;  // Queue is empty
                    }
                    // Tail is lagging, help advance it
                    tail_.compare_exchange_weak(
                        tail, next,
                        std::memory_order_release,
                        std::memory_order_relaxed
                    );
                } else {
                    T* data = next->data.load(std::memory_order_relaxed);
                    if (head_.compare_exchange_weak(
                        head, next,
                        std::memory_order_release,
                        std::memory_order_relaxed
                    )) {
                        T value = std::move(*data);
                        delete data;
                        // WARNING: Can't safely delete head yet
                        return value;
                    }
                }
            }
        }
    }
};

The “helping” mechanism is crucial: if a thread sees that the tail pointer is lagging (there’s a node after tail), it helps advance the tail before proceeding. This ensures the queue remains in a consistent state even if a thread stalls mid-operation.

The ABA Problem

Here’s the subtle bug lurking in our stack implementation. Consider this sequence:

Thread A reads head = X, prepares to pop
Thread A is preempted
Thread B pops X, pops Y, pushes X back (X is recycled)
Thread A resumes, CAS succeeds (head is still X), but X->next is now garbage

The pointer value matched, but the node’s contents changed. This is the ABA problem.

Tagged pointers solve this by packing a version counter with the pointer:

#include <atomic>
#include <cstdint>

template<typename T>
class TaggedPointer {
    // Pack pointer (48 bits on x86-64) with counter (16 bits)
    std::atomic<uint64_t> packed_;
    
    static constexpr uint64_t PTR_MASK = 0x0000FFFFFFFFFFFF;
    static constexpr uint64_t TAG_MASK = 0xFFFF000000000000;
    static constexpr int TAG_SHIFT = 48;
    
public:
    struct Snapshot {
        T* ptr;
        uint16_t tag;
        
        uint64_t pack() const {
            return (static_cast<uint64_t>(tag) << TAG_SHIFT) |
                   (reinterpret_cast<uint64_t>(ptr) & PTR_MASK);
        }
    };
    
    Snapshot load(std::memory_order order = std::memory_order_seq_cst) const {
        uint64_t value = packed_.load(order);
        return {
            reinterpret_cast<T*>(value & PTR_MASK),
            static_cast<uint16_t>((value & TAG_MASK) >> TAG_SHIFT)
        };
    }
    
    bool compare_exchange_weak(
        Snapshot& expected,
        T* desired,
        std::memory_order success = std::memory_order_seq_cst,
        std::memory_order failure = std::memory_order_seq_cst
    ) {
        uint64_t expected_packed = expected.pack();
        uint64_t desired_packed = 
            (static_cast<uint64_t>(expected.tag + 1) << TAG_SHIFT) |
            (reinterpret_cast<uint64_t>(desired) & PTR_MASK);
        
        if (packed_.compare_exchange_weak(expected_packed, desired_packed, 
                                          success, failure)) {
            return true;
        }
        
        expected = {
            reinterpret_cast<T*>(expected_packed & PTR_MASK),
            static_cast<uint16_t>((expected_packed & TAG_MASK) >> TAG_SHIFT)
        };
        return false;
    }
};

Every successful CAS increments the tag, so even if the pointer value recycles, the tag won’t match. This works on x86-64 where only 48 bits are used for virtual addresses, giving us 16 bits for the counter.

Memory Reclamation Strategies

Tagged pointers prevent ABA but don’t solve memory reclamation. Epoch-based reclamation (EBR) offers a practical solution:

class EpochManager {
    std::atomic<uint64_t> global_epoch_{0};
    static thread_local uint64_t local_epoch_;
    static thread_local bool in_critical_section_;
    static thread_local std::vector<void*> retire_lists_[3];
    
public:
    void enter_critical_section() {
        local_epoch_ = global_epoch_.load(std::memory_order_acquire);
        in_critical_section_ = true;
        std::atomic_thread_fence(std::memory_order_seq_cst);
    }
    
    void exit_critical_section() {
        std::atomic_thread_fence(std::memory_order_release);
        in_critical_section_ = false;
    }
    
    void retire(void* ptr) {
        retire_lists_[local_epoch_ % 3].push_back(ptr);
    }
    
    void try_reclaim() {
        uint64_t epoch = global_epoch_.load(std::memory_order_relaxed);
        // If all threads have moved past epoch-2, reclaim that list
        // (Simplified: real implementation tracks per-thread epochs)
        for (void* ptr : retire_lists_[(epoch + 1) % 3]) {
            delete static_cast<Node*>(ptr);
        }
        retire_lists_[(epoch + 1) % 3].clear();
        global_epoch_.fetch_add(1, std::memory_order_release);
    }
};

The idea: nodes aren’t freed immediately but “retired” to a list. Once all threads have advanced past the epoch when the node was retired, it’s safe to reclaim. This amortizes reclamation cost and avoids the complexity of hazard pointers.

Performance Considerations and Tradeoffs

Lock-free structures shine under specific conditions: high thread counts, short operations, and when you can’t tolerate worst-case latency from lock convoys. But they’re not magic.

Cache line contention kills performance. When multiple threads CAS the same atomic variable, they bounce the cache line between cores. Padding structures to cache line boundaries helps:

struct alignas(64) PaddedCounter {
    std::atomic<uint64_t> value;
    char padding[64 - sizeof(std::atomic<uint64_t>)];
};

Benchmark before committing. In low-contention scenarios, a simple std::mutex often outperforms lock-free alternatives because the uncontended mutex path is highly optimized (often just a single atomic operation). Lock-free structures pay their complexity tax on every operation; mutexes only pay under contention.

Consider using proven implementations: folly::MPMCQueue, boost::lockfree::queue, or Intel TBB’s concurrent containers. These have been battle-tested and handle edge cases you haven’t thought of yet. Roll your own only when you have specific requirements these can’t meet—and even then, invest heavily in testing with tools like ThreadSanitizer.