Rust Atomic Types: Lock-Free Concurrency
Atomic operations are indivisible read-modify-write operations that execute without interference from other threads. Unlike mutexes that use operating system primitives to block threads, atomics use...
Key Insights
- Atomic operations provide lock-free synchronization primitives that outperform mutexes for simple shared state, but require careful memory ordering choices to ensure correctness across CPU architectures.
- Memory ordering (Relaxed, Acquire, Release, SeqCst) trades performance for synchronization guarantees—Relaxed is fastest but provides no ordering, while SeqCst ensures total ordering at the cost of performance.
- Lock-free programming with atomics eliminates blocking but introduces subtle bugs like the ABA problem, making it suitable only when profiling proves mutexes are a bottleneck.
Introduction to Atomics and Lock-Free Programming
Atomic operations are indivisible read-modify-write operations that execute without interference from other threads. Unlike mutexes that use operating system primitives to block threads, atomics use CPU instructions to ensure thread-safe access to shared memory without locks.
The performance difference matters when contention is high or operations are frequent. Consider a simple counter accessed by multiple threads:
use std::sync::{Arc, Mutex};
use std::sync::atomic::{AtomicU64, Ordering};
use std::thread;
// Mutex-based counter
fn mutex_counter() {
let counter = Arc::new(Mutex::new(0u64));
let mut handles = vec![];
for _ in 0..4 {
let counter = Arc::clone(&counter);
handles.push(thread::spawn(move || {
for _ in 0..1_000_000 {
let mut num = counter.lock().unwrap();
*num += 1;
}
}));
}
for handle in handles {
handle.join().unwrap();
}
}
// Atomic counter
fn atomic_counter() {
let counter = Arc::new(AtomicU64::new(0));
let mut handles = vec![];
for _ in 0..4 {
let counter = Arc::clone(&counter);
handles.push(thread::spawn(move || {
for _ in 0..1_000_000 {
counter.fetch_add(1, Ordering::Relaxed);
}
}));
}
for handle in handles {
handle.join().unwrap();
}
}
The atomic version runs 5-10x faster in typical scenarios. Mutexes involve syscalls, context switches, and kernel scheduling. Atomics execute as single CPU instructions.
Use atomics for simple operations on primitive types: counters, flags, pointers. Use mutexes when you need to protect compound operations or complex data structures.
Understanding Memory Ordering
Modern CPUs reorder memory operations for performance. Without constraints, one thread might observe writes from another thread in unexpected orders. Memory ordering specifies synchronization guarantees.
Rust provides five ordering levels:
- Relaxed: No ordering guarantees, only atomicity
- Acquire: Prevents later reads/writes from moving before this operation
- Release: Prevents earlier reads/writes from moving after this operation
- AcqRel: Combines Acquire and Release
- SeqCst: Sequential consistency—total ordering across all threads
For a simple counter where you only care about the final value, Relaxed suffices:
use std::sync::atomic::{AtomicU64, Ordering};
static REQUESTS: AtomicU64 = AtomicU64::new(0);
fn handle_request() {
// Just increment, order doesn't matter
REQUESTS.fetch_add(1, Ordering::Relaxed);
}
fn get_total_requests() -> u64 {
REQUESTS.load(Ordering::Relaxed)
}
For producer-consumer patterns, use Acquire/Release to establish happens-before relationships:
use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
static DATA: AtomicU64 = AtomicU64::new(0);
static READY: AtomicBool = AtomicBool::new(false);
// Producer thread
fn producer() {
DATA.store(42, Ordering::Relaxed);
// Release ensures DATA write completes before READY
READY.store(true, Ordering::Release);
}
// Consumer thread
fn consumer() {
// Acquire ensures we see DATA write after READY
while !READY.load(Ordering::Acquire) {
std::hint::spin_loop();
}
let value = DATA.load(Ordering::Relaxed);
assert_eq!(value, 42); // Guaranteed to see 42
}
The Release store synchronizes with the Acquire load, ensuring the consumer sees the data write. Without proper ordering, the CPU might reorder operations and the consumer could read stale data.
Core Atomic Types and Operations
Rust’s std::sync::atomic module provides atomic types for common primitives:
use std::sync::atomic::*;
// AtomicBool for flags
static SHUTDOWN: AtomicBool = AtomicBool::new(false);
fn signal_shutdown() {
SHUTDOWN.store(true, Ordering::Release);
}
fn worker_loop() {
while !SHUTDOWN.load(Ordering::Acquire) {
// Do work
}
}
// AtomicU64 for counters
static BYTES_SENT: AtomicU64 = AtomicU64::new(0);
fn send_data(bytes: u64) {
BYTES_SENT.fetch_add(bytes, Ordering::Relaxed);
}
// AtomicUsize for indices
static NEXT_ID: AtomicUsize = AtomicUsize::new(0);
fn allocate_id() -> usize {
NEXT_ID.fetch_add(1, Ordering::Relaxed)
}
The compare_exchange operation is fundamental for lock-free algorithms. It atomically compares a value and swaps if equal:
use std::sync::atomic::{AtomicPtr, Ordering};
use std::ptr;
struct Node {
value: i32,
next: *mut Node,
}
struct Stack {
head: AtomicPtr<Node>,
}
impl Stack {
fn push(&self, value: i32) {
let new_node = Box::into_raw(Box::new(Node {
value,
next: ptr::null_mut(),
}));
loop {
let head = self.head.load(Ordering::Relaxed);
unsafe { (*new_node).next = head; }
// Try to swap: if head unchanged, install new_node
match self.head.compare_exchange(
head,
new_node,
Ordering::Release, // Success ordering
Ordering::Relaxed, // Failure ordering
) {
Ok(_) => return, // Success
Err(_) => continue, // Retry if head changed
}
}
}
}
This lock-free stack push uses compare_exchange to atomically update the head pointer only if it hasn’t changed since we loaded it.
Building Lock-Free Data Structures
Lock-free data structures avoid blocking but require careful design. Here’s a single-producer, single-consumer queue:
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;
const CAPACITY: usize = 1024;
struct SPSCQueue<T> {
buffer: Vec<Option<T>>,
head: AtomicUsize, // Consumer reads here
tail: AtomicUsize, // Producer writes here
}
impl<T> SPSCQueue<T> {
fn new() -> Self {
let mut buffer = Vec::with_capacity(CAPACITY);
for _ in 0..CAPACITY {
buffer.push(None);
}
SPSCQueue {
buffer,
head: AtomicUsize::new(0),
tail: AtomicUsize::new(0),
}
}
fn push(&mut self, value: T) -> Result<(), T> {
let tail = self.tail.load(Ordering::Relaxed);
let next_tail = (tail + 1) % CAPACITY;
// Check if full
if next_tail == self.head.load(Ordering::Acquire) {
return Err(value);
}
self.buffer[tail] = Some(value);
self.tail.store(next_tail, Ordering::Release);
Ok(())
}
fn pop(&mut self) -> Option<T> {
let head = self.head.load(Ordering::Relaxed);
// Check if empty
if head == self.tail.load(Ordering::Acquire) {
return None;
}
let value = self.buffer[head].take();
let next_head = (head + 1) % CAPACITY;
self.head.store(next_head, Ordering::Release);
value
}
}
The Acquire/Release pairs ensure the producer’s write to the buffer is visible to the consumer before updating indices.
A spinlock demonstrates atomic compare_exchange for mutual exclusion:
use std::sync::atomic::{AtomicBool, Ordering};
use std::cell::UnsafeCell;
pub struct SpinLock<T> {
locked: AtomicBool,
data: UnsafeCell<T>,
}
unsafe impl<T: Send> Sync for SpinLock<T> {}
impl<T> SpinLock<T> {
pub fn new(data: T) -> Self {
SpinLock {
locked: AtomicBool::new(false),
data: UnsafeCell::new(data),
}
}
pub fn lock(&self) -> SpinLockGuard<T> {
while self.locked.swap(true, Ordering::Acquire) {
std::hint::spin_loop();
}
SpinLockGuard { lock: self }
}
}
pub struct SpinLockGuard<'a, T> {
lock: &'a SpinLock<T>,
}
impl<T> Drop for SpinLockGuard<'_, T> {
fn drop(&mut self) {
self.lock.locked.store(false, Ordering::Release);
}
}
ABA Problem and Solutions
The ABA problem occurs when a value changes from A to B and back to A, fooling compare_exchange into thinking nothing changed:
// Thread 1 reads head (A)
let head = stack.head.load(Ordering::Acquire);
// Thread 2 pops A, pops B, pushes A back
// Now head is A again, but it's a different A!
// Thread 1's compare_exchange succeeds incorrectly
stack.head.compare_exchange(head, new_node, ...); // Oops!
Solutions include tagged pointers with generation counters:
use std::sync::atomic::{AtomicU64, Ordering};
// Pack pointer and counter into 64 bits
// Lower 48 bits: pointer, upper 16 bits: counter
struct TaggedPtr(AtomicU64);
impl TaggedPtr {
fn new(ptr: *mut Node) -> Self {
TaggedPtr(AtomicU64::new(ptr as u64))
}
fn load(&self) -> (*mut Node, u16) {
let packed = self.0.load(Ordering::Acquire);
let ptr = (packed & 0xFFFF_FFFF_FFFF) as *mut Node;
let tag = (packed >> 48) as u16;
(ptr, tag)
}
fn compare_exchange(
&self,
current_ptr: *mut Node,
current_tag: u16,
new_ptr: *mut Node,
) -> Result<(), ()> {
let current = (current_ptr as u64) | ((current_tag as u64) << 48);
let new_tag = current_tag.wrapping_add(1);
let new = (new_ptr as u64) | ((new_tag as u64) << 48);
self.0.compare_exchange(current, new, Ordering::Release, Ordering::Relaxed)
.map(|_| ())
.map_err(|_| ())
}
}
Each modification increments the tag, preventing ABA since the tag will differ even if the pointer matches.
Performance Considerations and Best Practices
Choose memory ordering based on synchronization needs, not habit. SeqCst is safe but slow. Relaxed is fast but provides no ordering. Profile to find bottlenecks.
Avoid false sharing—when atomic variables share cache lines with other data:
use std::sync::atomic::AtomicU64;
// Bad: atomics share cache line (typically 64 bytes)
struct Counters {
counter1: AtomicU64,
counter2: AtomicU64,
}
// Good: pad to separate cache lines
#[repr(align(64))]
struct PaddedCounter {
counter: AtomicU64,
}
struct Counters {
counter1: PaddedCounter,
counter2: PaddedCounter,
}
Without padding, writes to counter1 invalidate counter2’s cache line on other cores, causing performance degradation.
Benchmark different orderings for your workload:
use std::time::Instant;
fn bench_ordering() {
let counter = AtomicU64::new(0);
let start = Instant::now();
for _ in 0..10_000_000 {
counter.fetch_add(1, Ordering::Relaxed);
}
println!("Relaxed: {:?}", start.elapsed());
let start = Instant::now();
for _ in 0..10_000_000 {
counter.fetch_add(1, Ordering::SeqCst);
}
println!("SeqCst: {:?}", start.elapsed());
}
On x86, Relaxed and SeqCst perform similarly for simple operations. On ARM, the difference is significant.
Conclusion and Further Resources
Atomics enable lock-free concurrency for specific use cases: counters, flags, simple queues. They’re faster than mutexes but harder to use correctly. Start with SeqCst for correctness, optimize to weaker orderings only when profiling proves necessary.
For production lock-free code, use battle-tested libraries like crossbeam rather than rolling your own. The crossbeam-epoch crate solves memory reclamation in lock-free structures. For specialized needs, explore seqlock for reader-writer scenarios and hazard pointers for safe memory reclamation.
Lock-free programming is a sharp tool. Use it when mutexes demonstrably bottleneck your application, not as a premature optimization. Most concurrent programs work fine with mutexes and channels.