Compare-and-Swap: Lock-Free Primitive

Key Insights

Compare-and-swap (CAS) is a hardware-supported atomic primitive that enables lock-free programming by allowing threads to update shared state without blocking, but it requires careful handling of retry loops and the ABA problem.
CAS outperforms traditional locks under low contention but degrades under high contention—choosing between them depends on your specific workload characteristics and critical section complexity.
Modern languages provide CAS through standardized APIs (std::atomic, AtomicReference, compare_exchange), but building correct lock-free data structures remains challenging and should only be attempted when proven necessary.

What is Compare-and-Swap (CAS)?

Compare-and-swap is an atomic CPU instruction that performs three operations as a single, indivisible unit: read a memory location, compare it against an expected value, and write a new value only if the comparison succeeds. The operation returns whether the swap occurred.

This atomicity guarantee is the foundation of lock-free programming. Without it, a read-modify-write sequence could be interrupted by another thread, leading to lost updates or corrupted state. CAS gives you a building block for coordinating concurrent access without blocking threads.

Here’s the conceptual operation:

function compare_and_swap(address, expected, desired):
    atomic:
        current = *address
        if current == expected:
            *address = desired
            return true
        else:
            return false

The critical insight is that CAS can fail. When another thread modifies the value between your read and your CAS attempt, the comparison fails. Your code must handle this—typically by retrying in a loop. This retry pattern is fundamental to all CAS-based algorithms.

Hardware Support and Memory Semantics

CAS isn’t a software abstraction—it’s a CPU instruction. On x86/x64, the CMPXCHG instruction performs compare-and-swap with the LOCK prefix ensuring atomicity across cores. ARM uses a different approach: load-linked/store-conditional pairs (LDXR/STXR) that detect interference rather than locking the bus.

Here’s what CAS looks like at the instruction level:

// x86-64 inline assembly for CAS
static inline bool cas_x86(uint64_t* ptr, uint64_t expected, uint64_t desired) {
    uint64_t prev;
    __asm__ __volatile__(
        "lock cmpxchgq %2, %1"
        : "=a"(prev), "+m"(*ptr)
        : "r"(desired), "0"(expected)
        : "memory"
    );
    return prev == expected;
}

// ARM64 equivalent using load-exclusive/store-exclusive
static inline bool cas_arm64(uint64_t* ptr, uint64_t expected, uint64_t desired) {
    uint64_t prev;
    uint32_t success;
    __asm__ __volatile__(
        "1: ldxr %0, [%2]\n"
        "   cmp %0, %3\n"
        "   b.ne 2f\n"
        "   stxr %w1, %4, [%2]\n"
        "   cbnz %w1, 1b\n"
        "2:"
        : "=&r"(prev), "=&r"(success)
        : "r"(ptr), "r"(expected), "r"(desired)
        : "memory", "cc"
    );
    return prev == expected;
}

Memory ordering matters. A CAS operation typically implies a full memory barrier, ensuring that memory operations before the CAS are visible to other threads after a successful swap. Most language-level APIs let you specify weaker orderings (memory_order_relaxed, memory_order_acquire, memory_order_release) when you don’t need full sequential consistency—but getting this wrong causes subtle bugs that only manifest under specific timing conditions.

CAS vs Traditional Locks

Mutexes block threads. When a thread can’t acquire a lock, the OS deschedules it, context-switches to another thread, and later wakes the blocked thread. This works well for long critical sections but adds significant overhead for short ones.

CAS-based algorithms don’t block. A thread that fails a CAS simply retries, staying on the CPU. This eliminates context-switch overhead and avoids priority inversion (where a high-priority thread waits for a low-priority thread holding a lock).

The trade-offs:

Factor	Mutex	CAS
Low contention	Good	Excellent
High contention	Degrades gracefully	Spins wastefully
Critical section size	Any	Must be tiny
Complexity	Simple	Error-prone
Priority inversion	Possible	Impossible
Fairness	Configurable	None

Under high contention, CAS loops burn CPU cycles retrying failed operations. Mutexes handle this better by putting waiting threads to sleep. Hybrid approaches exist—spinning briefly before blocking—but add complexity.

Building Lock-Free Data Structures

CAS enables lock-free data structures through a pattern: read current state, compute new state, attempt CAS, retry on failure. Here’s a lock-free stack in C++:

#include <atomic>
#include <memory>

template<typename T>
class LockFreeStack {
private:
    struct Node {
        T data;
        Node* next;
        Node(T value) : data(std::move(value)), next(nullptr) {}
    };
    
    std::atomic<Node*> head{nullptr};

public:
    void push(T value) {
        Node* new_node = new Node(std::move(value));
        new_node->next = head.load(std::memory_order_relaxed);
        
        // Retry loop: keep trying until CAS succeeds
        while (!head.compare_exchange_weak(
            new_node->next,  // expected (updated on failure)
            new_node,        // desired
            std::memory_order_release,
            std::memory_order_relaxed
        )) {
            // new_node->next is automatically updated to current head
        }
    }
    
    std::optional<T> pop() {
        Node* old_head = head.load(std::memory_order_acquire);
        
        while (old_head != nullptr) {
            if (head.compare_exchange_weak(
                old_head,
                old_head->next,
                std::memory_order_acquire,
                std::memory_order_relaxed
            )) {
                T value = std::move(old_head->data);
                delete old_head;  // Danger: ABA problem lurks here
                return value;
            }
            // old_head is updated to current head on failure
        }
        return std::nullopt;
    }
};

The compare_exchange_weak variant may fail spuriously on some architectures (notably ARM) but is faster in loops. Use compare_exchange_strong when you need exactly one attempt.

Note the comment about ABA—this code has a bug we’ll address next.

The ABA Problem

The ABA problem is the classic CAS pitfall. Consider this sequence:

Thread 1 reads head = A, prepares to pop
Thread 1 is preempted
Thread 2 pops A, pops B, pushes A back (same pointer, possibly different node)
Thread 1 resumes, CAS succeeds (head is still A)
Thread 1 follows A->next, which now points somewhere invalid

The CAS succeeded because the pointer value matched, but the underlying state changed. Thread 1’s old_head->next is stale.

// Demonstrating ABA with tagged pointers solution
template<typename T>
class ABAFreeStack {
private:
    struct Node {
        T data;
        Node* next;
    };
    
    // Pack pointer and counter into a single atomic unit
    struct TaggedPtr {
        Node* ptr;
        uintptr_t tag;  // Incremented on every modification
        
        bool operator==(const TaggedPtr& other) const {
            return ptr == other.ptr && tag == other.tag;
        }
    };
    
    // Requires 128-bit CAS on 64-bit systems (CMPXCHG16B on x86-64)
    std::atomic<TaggedPtr> head{{nullptr, 0}};

public:
    void push(T value) {
        Node* new_node = new Node{std::move(value), nullptr};
        TaggedPtr old_head = head.load(std::memory_order_relaxed);
        TaggedPtr new_head;
        
        do {
            new_node->next = old_head.ptr;
            new_head = {new_node, old_head.tag + 1};
        } while (!head.compare_exchange_weak(
            old_head, new_head,
            std::memory_order_release,
            std::memory_order_relaxed
        ));
    }
    
    std::optional<T> pop() {
        TaggedPtr old_head = head.load(std::memory_order_acquire);
        TaggedPtr new_head;
        
        while (old_head.ptr != nullptr) {
            new_head = {old_head.ptr->next, old_head.tag + 1};
            if (head.compare_exchange_weak(
                old_head, new_head,
                std::memory_order_acquire,
                std::memory_order_relaxed
            )) {
                T value = std::move(old_head.ptr->data);
                // Still need safe memory reclamation here
                return value;
            }
        }
        return std::nullopt;
    }
};

The tag increments on every modification, so even if the pointer returns to A, the tag differs. This requires double-width CAS (128-bit on 64-bit systems), which most modern CPUs support.

Other solutions include hazard pointers (threads publish pointers they’re accessing) and epoch-based reclamation (defer deletion until all threads have passed a barrier).

Language-Level APIs

Every major language provides CAS primitives. Here’s an atomic counter in three languages:

// C++: std::atomic
#include <atomic>

class AtomicCounter {
    std::atomic<int64_t> value{0};
public:
    int64_t increment() {
        int64_t current = value.load(std::memory_order_relaxed);
        while (!value.compare_exchange_weak(
            current, current + 1,
            std::memory_order_relaxed
        )) {}
        return current + 1;
    }
};

// Rust: std::sync::atomic
use std::sync::atomic::{AtomicI64, Ordering};

struct AtomicCounter {
    value: AtomicI64,
}

impl AtomicCounter {
    fn increment(&self) -> i64 {
        loop {
            let current = self.value.load(Ordering::Relaxed);
            match self.value.compare_exchange_weak(
                current,
                current + 1,
                Ordering::Relaxed,
                Ordering::Relaxed,
            ) {
                Ok(_) => return current + 1,
                Err(_) => continue,
            }
        }
    }
}

// Java: java.util.concurrent.atomic
import java.util.concurrent.atomic.AtomicLong;

public class AtomicCounter {
    private final AtomicLong value = new AtomicLong(0);
    
    public long increment() {
        long current;
        do {
            current = value.get();
        } while (!value.compareAndSet(current, current + 1));
        return current + 1;
    }
}

For simple operations like increment, use the built-in fetch_add / incrementAndGet methods—they’re optimized and less error-prone than manual CAS loops.

When to Use (and Avoid) CAS

Use CAS when:

Contention is low and operations are short
You need non-blocking progress guarantees
You’re implementing simple state machines (e.g., initialization flags)
Lock overhead dominates your critical section

Avoid CAS when:

Contention is high (spinning wastes CPU)
Critical sections are complex (CAS only works for single-location updates)
Correctness is more important than performance (locks are simpler to reason about)
Higher-level abstractions exist (use ConcurrentHashMap, not a hand-rolled lock-free map)

The honest advice: unless you’re writing infrastructure code or have profiler data showing lock contention, use standard synchronization primitives. Lock-free programming is notoriously difficult to get right, and the performance benefits rarely justify the complexity and bug risk.

When you do need CAS, start with language-provided concurrent collections before rolling your own. They’re written by experts, tested extensively, and handle edge cases you haven’t thought of.