Atomic Operations: Hardware-Level Synchronization

Key Insights

Atomic operations leverage CPU cache coherency protocols and memory barriers to guarantee indivisible operations without operating system involvement, making them significantly faster than locks for simple operations under low contention.
Memory ordering semantics (acquire, release, relaxed, sequential consistency) are as important as atomicity itself—incorrect ordering causes subtle bugs that only manifest on specific architectures or under heavy load.
Lock-free data structures using atomics aren’t automatically faster than mutex-based alternatives; contention patterns, cache line sharing, and operation complexity determine which approach wins in practice.

Introduction: Why Atomicity Matters

Consider a simple counter increment: counter++. This single line compiles to at least three CPU operations—load, add, store. Between any of these steps, another thread can intervene, leading to lost updates. You’ve probably reached for a mutex to solve this, but mutexes involve system calls, context switches, and kernel transitions. For a single integer increment, that’s massive overhead.

Hardware designers recognized this problem decades ago. Modern CPUs provide instructions that execute indivisibly from the perspective of all other processors in the system. These atomic operations form the foundation of all synchronization primitives, including the mutexes you’re already using.

Understanding atomics isn’t just about micro-optimization. It’s about knowing what guarantees your code actually has and building correct concurrent systems from first principles.

How Atomic Operations Work at the Hardware Level

When you execute an atomic operation, the CPU must ensure no other processor can observe or modify the target memory location mid-operation. This happens through two mechanisms: bus locking and cache locking.

Bus locking is the older, heavier approach. The CPU asserts a LOCK signal on the memory bus, preventing all other processors from accessing memory. This works but kills parallelism across the entire system.

Cache locking is what modern CPUs actually use. The MESI protocol (Modified, Exclusive, Shared, Invalid) maintains cache coherency across cores. When a core needs exclusive access to a cache line for an atomic operation, it broadcasts an invalidation message. Other cores must invalidate their copies before the operation proceeds.

Here’s what happens during an atomic compare-and-swap:

The executing core acquires the cache line in Exclusive or Modified state
Other cores invalidate their copies of that line
The operation executes entirely within the cache
The result becomes visible to other cores through the coherency protocol

Memory barriers (fences) complement atomicity by controlling operation ordering. A store barrier ensures all prior stores complete before subsequent ones. A load barrier does the same for loads. Full barriers combine both. Without barriers, the CPU and compiler freely reorder operations for performance—often breaking your concurrent algorithms.

Common Atomic Primitives

Compare-and-Swap (CAS)

CAS is the workhorse of lock-free programming. It atomically compares a memory location to an expected value and, only if they match, replaces it with a new value.

#include <stdatomic.h>
#include <stdbool.h>

// Atomic compare-and-swap
bool cas_example(atomic_int *target, int expected, int desired) {
    return atomic_compare_exchange_strong(target, &expected, desired);
}

// Using CAS to implement atomic max
void atomic_max(atomic_int *target, int value) {
    int current = atomic_load(target);
    while (value > current) {
        if (atomic_compare_exchange_weak(target, &current, value)) {
            break;
        }
        // current is updated by CAS failure
    }
}

Use compare_exchange_weak in loops—it may spuriously fail but is faster on some architectures. Use compare_exchange_strong when you need a single attempt.

Fetch-and-Add

Atomically adds to a value and returns the original. Simpler and faster than CAS for counters.

long atomic_counter_increment(atomic_long *counter) {
    return atomic_fetch_add(counter, 1);
}

// Equivalent using GCC intrinsics
long atomic_counter_increment_gcc(long *counter) {
    return __atomic_fetch_add(counter, 1, __ATOMIC_SEQ_CST);
}

Test-and-Set

The simplest atomic primitive—sets a flag and returns its previous value. Often used for spinlocks.

typedef atomic_flag spinlock_t;

void spin_lock(spinlock_t *lock) {
    while (atomic_flag_test_and_set(lock)) {
        // Spin until we acquire
    }
}

void spin_unlock(spinlock_t *lock) {
    atomic_flag_clear(lock);
}

Load-Linked/Store-Conditional (LL/SC)

ARM and other RISC architectures use LL/SC instead of CAS. Load-linked reads a value and sets a reservation. Store-conditional writes only if the reservation is still valid.

// ARM64 implementation concept (actual syntax varies)
// This is what the compiler generates for atomic_compare_exchange

int llsc_increment(atomic_int *target) {
    int old_val, new_val, result;
    do {
        old_val = atomic_load_explicit(target, memory_order_relaxed);
        new_val = old_val + 1;
    } while (!atomic_compare_exchange_weak_explicit(
        target, &old_val, new_val,
        memory_order_release, memory_order_relaxed));
    return new_val;
}

Memory Ordering and Consistency Models

Atomicity guarantees the operation is indivisible. Memory ordering guarantees when that operation becomes visible relative to other operations. These are orthogonal concerns.

// BROKEN: Missing synchronization
atomic_int data = 0;
atomic_int ready = 0;

// Thread 1
void producer(void) {
    atomic_store_explicit(&data, 42, memory_order_relaxed);
    atomic_store_explicit(&ready, 1, memory_order_relaxed);  // BUG!
}

// Thread 2
void consumer(void) {
    while (atomic_load_explicit(&ready, memory_order_relaxed) == 0);
    int value = atomic_load_explicit(&data, memory_order_relaxed);
    // value might be 0! Stores can be reordered.
}

The fix uses acquire/release semantics:

// CORRECT: Proper memory ordering
void producer_fixed(void) {
    atomic_store_explicit(&data, 42, memory_order_relaxed);
    atomic_store_explicit(&ready, 1, memory_order_release);  // Release barrier
}

void consumer_fixed(void) {
    while (atomic_load_explicit(&ready, memory_order_acquire) == 0);  // Acquire barrier
    int value = atomic_load_explicit(&data, memory_order_relaxed);
    // value is guaranteed to be 42
}

Release ensures all prior writes are visible before the release store. Acquire ensures all subsequent reads see writes that happened before the corresponding release. Together, they create a synchronization point.

Use memory_order_seq_cst (the default) when in doubt—it provides the strongest guarantees but may sacrifice performance.

Building Lock-Free Data Structures

Here’s a lock-free stack using CAS:

#include <atomic>
#include <memory>

template<typename T>
class LockFreeStack {
private:
    struct Node {
        T data;
        Node* next;
        Node(const T& d) : data(d), next(nullptr) {}
    };
    
    std::atomic<Node*> head{nullptr};
    
public:
    void push(const T& data) {
        Node* new_node = new Node(data);
        new_node->next = head.load(std::memory_order_relaxed);
        
        while (!head.compare_exchange_weak(
            new_node->next, new_node,
            std::memory_order_release,
            std::memory_order_relaxed)) {
            // CAS failed; new_node->next updated to current head
        }
    }
    
    bool pop(T& result) {
        Node* old_head = head.load(std::memory_order_acquire);
        
        while (old_head != nullptr) {
            if (head.compare_exchange_weak(
                old_head, old_head->next,
                std::memory_order_acquire,
                std::memory_order_relaxed)) {
                result = old_head->data;
                delete old_head;  // Danger: ABA problem!
                return true;
            }
        }
        return false;
    }
};

The ABA Problem: Between reading old_head and the CAS, another thread might pop the node, free it, allocate a new node at the same address, and push it. The CAS succeeds despite the stack being in a different state.

Solutions include hazard pointers, epoch-based reclamation, or tagged pointers (using unused address bits as a counter).

Performance Considerations and Trade-offs

#include <atomic>
#include <mutex>
#include <thread>
#include <chrono>
#include <vector>
#include <iostream>

std::atomic<long> atomic_counter{0};
long mutex_counter = 0;
std::mutex counter_mutex;

void benchmark_atomic(int iterations) {
    for (int i = 0; i < iterations; ++i) {
        atomic_counter.fetch_add(1, std::memory_order_relaxed);
    }
}

void benchmark_mutex(int iterations) {
    for (int i = 0; i < iterations; ++i) {
        std::lock_guard<std::mutex> lock(counter_mutex);
        mutex_counter++;
    }
}

int main() {
    const int iterations = 1000000;
    const int thread_counts[] = {1, 2, 4, 8};
    
    for (int num_threads : thread_counts) {
        atomic_counter = 0;
        mutex_counter = 0;
        
        // Benchmark atomic
        auto start = std::chrono::high_resolution_clock::now();
        std::vector<std::thread> threads;
        for (int i = 0; i < num_threads; ++i) {
            threads.emplace_back(benchmark_atomic, iterations / num_threads);
        }
        for (auto& t : threads) t.join();
        auto atomic_time = std::chrono::high_resolution_clock::now() - start;
        
        // Benchmark mutex
        threads.clear();
        start = std::chrono::high_resolution_clock::now();
        for (int i = 0; i < num_threads; ++i) {
            threads.emplace_back(benchmark_mutex, iterations / num_threads);
        }
        for (auto& t : threads) t.join();
        auto mutex_time = std::chrono::high_resolution_clock::now() - start;
        
        std::cout << "Threads: " << num_threads 
                  << " Atomic: " << std::chrono::duration_cast<std::chrono::milliseconds>(atomic_time).count() << "ms"
                  << " Mutex: " << std::chrono::duration_cast<std::chrono::milliseconds>(mutex_time).count() << "ms\n";
    }
}

Typical results show atomics winning at low thread counts, but the gap narrows under high contention. False sharing—when atomic variables on the same cache line cause invalidation storms—can make atomics slower than expected. Align atomic variables to cache line boundaries (typically 64 bytes).

Practical Guidelines and When to Use What

Use mutexes when:

Protecting complex operations or multiple variables
Critical sections contain blocking operations
Code clarity matters more than microseconds

Use atomics when:

Operating on single values (counters, flags, pointers)
Lock-free progress guarantees are required
Profiling shows lock contention is a bottleneck

Use lock-free structures when:

You’ve proven mutexes are the bottleneck
You need progress guarantees (real-time systems)
You’re willing to invest in correctness verification

Common mistakes:

Using relaxed ordering without understanding the consequences
Assuming atomic operations are always faster
Forgetting that compound operations (check-then-act) need CAS loops
Ignoring the ABA problem in lock-free structures

Start with mutexes. Profile. Switch to atomics only where data proves it matters. Lock-free data structures are a last resort, not a first choice.