Compare-and-Swap: Lock-Free Primitive
Compare-and-swap is an atomic CPU instruction that performs three operations as a single, indivisible unit: read a memory location, compare it against an expected value, and write a new value only if...
Key Insights
- Compare-and-swap (CAS) is a hardware-supported atomic primitive that enables lock-free programming by allowing threads to update shared state without blocking, but it requires careful handling of retry loops and the ABA problem.
- CAS outperforms traditional locks under low contention but degrades under high contention—choosing between them depends on your specific workload characteristics and critical section complexity.
- Modern languages provide CAS through standardized APIs (
std::atomic,AtomicReference,compare_exchange), but building correct lock-free data structures remains challenging and should only be attempted when proven necessary.
What is Compare-and-Swap (CAS)?
Compare-and-swap is an atomic CPU instruction that performs three operations as a single, indivisible unit: read a memory location, compare it against an expected value, and write a new value only if the comparison succeeds. The operation returns whether the swap occurred.
This atomicity guarantee is the foundation of lock-free programming. Without it, a read-modify-write sequence could be interrupted by another thread, leading to lost updates or corrupted state. CAS gives you a building block for coordinating concurrent access without blocking threads.
Here’s the conceptual operation:
function compare_and_swap(address, expected, desired):
atomic:
current = *address
if current == expected:
*address = desired
return true
else:
return false
The critical insight is that CAS can fail. When another thread modifies the value between your read and your CAS attempt, the comparison fails. Your code must handle this—typically by retrying in a loop. This retry pattern is fundamental to all CAS-based algorithms.
Hardware Support and Memory Semantics
CAS isn’t a software abstraction—it’s a CPU instruction. On x86/x64, the CMPXCHG instruction performs compare-and-swap with the LOCK prefix ensuring atomicity across cores. ARM uses a different approach: load-linked/store-conditional pairs (LDXR/STXR) that detect interference rather than locking the bus.
Here’s what CAS looks like at the instruction level:
// x86-64 inline assembly for CAS
static inline bool cas_x86(uint64_t* ptr, uint64_t expected, uint64_t desired) {
uint64_t prev;
__asm__ __volatile__(
"lock cmpxchgq %2, %1"
: "=a"(prev), "+m"(*ptr)
: "r"(desired), "0"(expected)
: "memory"
);
return prev == expected;
}
// ARM64 equivalent using load-exclusive/store-exclusive
static inline bool cas_arm64(uint64_t* ptr, uint64_t expected, uint64_t desired) {
uint64_t prev;
uint32_t success;
__asm__ __volatile__(
"1: ldxr %0, [%2]\n"
" cmp %0, %3\n"
" b.ne 2f\n"
" stxr %w1, %4, [%2]\n"
" cbnz %w1, 1b\n"
"2:"
: "=&r"(prev), "=&r"(success)
: "r"(ptr), "r"(expected), "r"(desired)
: "memory", "cc"
);
return prev == expected;
}
Memory ordering matters. A CAS operation typically implies a full memory barrier, ensuring that memory operations before the CAS are visible to other threads after a successful swap. Most language-level APIs let you specify weaker orderings (memory_order_relaxed, memory_order_acquire, memory_order_release) when you don’t need full sequential consistency—but getting this wrong causes subtle bugs that only manifest under specific timing conditions.
CAS vs Traditional Locks
Mutexes block threads. When a thread can’t acquire a lock, the OS deschedules it, context-switches to another thread, and later wakes the blocked thread. This works well for long critical sections but adds significant overhead for short ones.
CAS-based algorithms don’t block. A thread that fails a CAS simply retries, staying on the CPU. This eliminates context-switch overhead and avoids priority inversion (where a high-priority thread waits for a low-priority thread holding a lock).
The trade-offs:
| Factor | Mutex | CAS |
|---|---|---|
| Low contention | Good | Excellent |
| High contention | Degrades gracefully | Spins wastefully |
| Critical section size | Any | Must be tiny |
| Complexity | Simple | Error-prone |
| Priority inversion | Possible | Impossible |
| Fairness | Configurable | None |
Under high contention, CAS loops burn CPU cycles retrying failed operations. Mutexes handle this better by putting waiting threads to sleep. Hybrid approaches exist—spinning briefly before blocking—but add complexity.
Building Lock-Free Data Structures
CAS enables lock-free data structures through a pattern: read current state, compute new state, attempt CAS, retry on failure. Here’s a lock-free stack in C++:
#include <atomic>
#include <memory>
template<typename T>
class LockFreeStack {
private:
struct Node {
T data;
Node* next;
Node(T value) : data(std::move(value)), next(nullptr) {}
};
std::atomic<Node*> head{nullptr};
public:
void push(T value) {
Node* new_node = new Node(std::move(value));
new_node->next = head.load(std::memory_order_relaxed);
// Retry loop: keep trying until CAS succeeds
while (!head.compare_exchange_weak(
new_node->next, // expected (updated on failure)
new_node, // desired
std::memory_order_release,
std::memory_order_relaxed
)) {
// new_node->next is automatically updated to current head
}
}
std::optional<T> pop() {
Node* old_head = head.load(std::memory_order_acquire);
while (old_head != nullptr) {
if (head.compare_exchange_weak(
old_head,
old_head->next,
std::memory_order_acquire,
std::memory_order_relaxed
)) {
T value = std::move(old_head->data);
delete old_head; // Danger: ABA problem lurks here
return value;
}
// old_head is updated to current head on failure
}
return std::nullopt;
}
};
The compare_exchange_weak variant may fail spuriously on some architectures (notably ARM) but is faster in loops. Use compare_exchange_strong when you need exactly one attempt.
Note the comment about ABA—this code has a bug we’ll address next.
The ABA Problem
The ABA problem is the classic CAS pitfall. Consider this sequence:
- Thread 1 reads head = A, prepares to pop
- Thread 1 is preempted
- Thread 2 pops A, pops B, pushes A back (same pointer, possibly different node)
- Thread 1 resumes, CAS succeeds (head is still A)
- Thread 1 follows
A->next, which now points somewhere invalid
The CAS succeeded because the pointer value matched, but the underlying state changed. Thread 1’s old_head->next is stale.
// Demonstrating ABA with tagged pointers solution
template<typename T>
class ABAFreeStack {
private:
struct Node {
T data;
Node* next;
};
// Pack pointer and counter into a single atomic unit
struct TaggedPtr {
Node* ptr;
uintptr_t tag; // Incremented on every modification
bool operator==(const TaggedPtr& other) const {
return ptr == other.ptr && tag == other.tag;
}
};
// Requires 128-bit CAS on 64-bit systems (CMPXCHG16B on x86-64)
std::atomic<TaggedPtr> head{{nullptr, 0}};
public:
void push(T value) {
Node* new_node = new Node{std::move(value), nullptr};
TaggedPtr old_head = head.load(std::memory_order_relaxed);
TaggedPtr new_head;
do {
new_node->next = old_head.ptr;
new_head = {new_node, old_head.tag + 1};
} while (!head.compare_exchange_weak(
old_head, new_head,
std::memory_order_release,
std::memory_order_relaxed
));
}
std::optional<T> pop() {
TaggedPtr old_head = head.load(std::memory_order_acquire);
TaggedPtr new_head;
while (old_head.ptr != nullptr) {
new_head = {old_head.ptr->next, old_head.tag + 1};
if (head.compare_exchange_weak(
old_head, new_head,
std::memory_order_acquire,
std::memory_order_relaxed
)) {
T value = std::move(old_head.ptr->data);
// Still need safe memory reclamation here
return value;
}
}
return std::nullopt;
}
};
The tag increments on every modification, so even if the pointer returns to A, the tag differs. This requires double-width CAS (128-bit on 64-bit systems), which most modern CPUs support.
Other solutions include hazard pointers (threads publish pointers they’re accessing) and epoch-based reclamation (defer deletion until all threads have passed a barrier).
Language-Level APIs
Every major language provides CAS primitives. Here’s an atomic counter in three languages:
// C++: std::atomic
#include <atomic>
class AtomicCounter {
std::atomic<int64_t> value{0};
public:
int64_t increment() {
int64_t current = value.load(std::memory_order_relaxed);
while (!value.compare_exchange_weak(
current, current + 1,
std::memory_order_relaxed
)) {}
return current + 1;
}
};
// Rust: std::sync::atomic
use std::sync::atomic::{AtomicI64, Ordering};
struct AtomicCounter {
value: AtomicI64,
}
impl AtomicCounter {
fn increment(&self) -> i64 {
loop {
let current = self.value.load(Ordering::Relaxed);
match self.value.compare_exchange_weak(
current,
current + 1,
Ordering::Relaxed,
Ordering::Relaxed,
) {
Ok(_) => return current + 1,
Err(_) => continue,
}
}
}
}
// Java: java.util.concurrent.atomic
import java.util.concurrent.atomic.AtomicLong;
public class AtomicCounter {
private final AtomicLong value = new AtomicLong(0);
public long increment() {
long current;
do {
current = value.get();
} while (!value.compareAndSet(current, current + 1));
return current + 1;
}
}
For simple operations like increment, use the built-in fetch_add / incrementAndGet methods—they’re optimized and less error-prone than manual CAS loops.
When to Use (and Avoid) CAS
Use CAS when:
- Contention is low and operations are short
- You need non-blocking progress guarantees
- You’re implementing simple state machines (e.g., initialization flags)
- Lock overhead dominates your critical section
Avoid CAS when:
- Contention is high (spinning wastes CPU)
- Critical sections are complex (CAS only works for single-location updates)
- Correctness is more important than performance (locks are simpler to reason about)
- Higher-level abstractions exist (use
ConcurrentHashMap, not a hand-rolled lock-free map)
The honest advice: unless you’re writing infrastructure code or have profiler data showing lock contention, use standard synchronization primitives. Lock-free programming is notoriously difficult to get right, and the performance benefits rarely justify the complexity and bug risk.
When you do need CAS, start with language-provided concurrent collections before rolling your own. They’re written by experts, tested extensively, and handle edge cases you haven’t thought of.