Memory Ordering: Sequential Consistency and Relaxed
Your CPU is lying to you. That neat sequence of instructions you wrote? The processor executes them out of order, speculatively, and across multiple cores that each have their own view of memory....
Key Insights
- Sequential consistency (SeqCst) provides the strongest guarantee—all threads observe operations in the same total order—but comes with significant performance overhead on weakly-ordered architectures like ARM.
- Relaxed ordering guarantees only atomicity, not ordering, making it suitable for independent counters but dangerous when coordinating between variables.
- Start with SeqCst for correctness, then relax ordering constraints only after profiling proves it necessary and careful analysis confirms safety.
Why Memory Ordering Matters
Your CPU is lying to you. That neat sequence of instructions you wrote? The processor executes them out of order, speculatively, and across multiple cores that each have their own view of memory. Compilers make it worse—they’ll happily reorder your stores and loads if it makes the generated code faster.
For single-threaded code, this is invisible. The hardware and compiler maintain the illusion of sequential execution. But the moment you share data between threads, that illusion shatters. A write that “happens before” a read in your source code might actually complete after it from another thread’s perspective.
Memory ordering is how we tell the compiler and CPU which reorderings are acceptable and which would break our program. Get it wrong, and you’ll have bugs that appear only under load, only on certain architectures, or only on Tuesdays during a full moon.
The Memory Model Foundation
A memory model defines the rules for how memory operations become visible across threads. It answers questions like: if thread A writes to variable X, when does thread B see that write? Can thread B see writes to X and Y in a different order than thread A performed them?
The key concept is the happens-before relationship. If operation A happens-before operation B, then A’s effects are guaranteed to be visible when B executes. Without an explicit happens-before relationship, you have a data race, and the behavior is undefined.
Memory orderings exist on a spectrum:
- Relaxed: Only atomicity is guaranteed. No ordering constraints.
- Acquire/Release: Creates happens-before relationships between specific pairs of operations.
- Sequential Consistency: All threads agree on a single total order of all SeqCst operations.
Stronger orderings mean more constraints on the hardware, which means more memory barriers and less optimization opportunity. The art is choosing the weakest ordering that still guarantees correctness.
Sequential Consistency (SeqCst)
Sequential consistency is the model most programmers intuitively expect. All threads observe all operations in some consistent total order that respects each thread’s program order. If you write A then B, every other thread sees A before B. If thread 1 writes X and thread 2 writes Y, all threads agree on whether X or Y happened first.
Here’s a classic pattern using SeqCst for flag-based synchronization:
use std::sync::atomic::{AtomicBool, AtomicI32, Ordering};
use std::thread;
static DATA: AtomicI32 = AtomicI32::new(0);
static READY: AtomicBool = AtomicBool::new(false);
fn main() {
let producer = thread::spawn(|| {
DATA.store(42, Ordering::SeqCst);
READY.store(true, Ordering::SeqCst);
});
let consumer = thread::spawn(|| {
while !READY.load(Ordering::SeqCst) {
// spin
}
assert_eq!(DATA.load(Ordering::SeqCst), 42);
});
producer.join().unwrap();
consumer.join().unwrap();
}
With SeqCst, this code is correct. The consumer will always see DATA as 42 when READY is true, because all threads observe the stores in the same order the producer performed them.
The C++ equivalent uses std::memory_order_seq_cst:
#include <atomic>
#include <thread>
#include <cassert>
std::atomic<int> data{0};
std::atomic<bool> ready{false};
void producer() {
data.store(42, std::memory_order_seq_cst);
ready.store(true, std::memory_order_seq_cst);
}
void consumer() {
while (!ready.load(std::memory_order_seq_cst)) {
// spin
}
assert(data.load(std::memory_order_seq_cst) == 42);
}
The cost? On x86, SeqCst stores require an MFENCE instruction or a locked operation, which can cost 20-100 cycles. On ARM, you need full barriers (DMB ISH) that prevent all reordering. For hot paths, this adds up.
Relaxed Ordering
Relaxed ordering provides the minimum guarantee: the operation is atomic (no torn reads or writes), but there’s no ordering constraint with respect to other operations. The compiler and CPU can reorder relaxed operations freely.
This sounds dangerous, and it is—for coordination. But for independent operations like counters, relaxed ordering is both safe and fast:
use std::sync::atomic::{AtomicU64, Ordering};
use std::thread;
static COUNTER: AtomicU64 = AtomicU64::new(0);
fn main() {
let handles: Vec<_> = (0..4)
.map(|_| {
thread::spawn(|| {
for _ in 0..1_000_000 {
COUNTER.fetch_add(1, Ordering::Relaxed);
}
})
})
.collect();
for handle in handles {
handle.join().unwrap();
}
println!("Final count: {}", COUNTER.load(Ordering::Relaxed));
}
This code is correct. Each increment is atomic, so no updates are lost. We don’t care about the order of increments—only that they all happen. The final load will see all increments because joining the threads establishes happens-before relationships.
On ARM, relaxed operations compile to plain loads and stores with no barriers. On x86, there’s no difference for loads, but stores still have release semantics due to the architecture’s strong memory model. Either way, relaxed is as fast as it gets.
The Danger Zone: When Relaxed Goes Wrong
Here’s where relaxed ordering betrays you. Consider the classic store-buffer litmus test:
use std::sync::atomic::{AtomicI32, Ordering};
use std::thread;
static X: AtomicI32 = AtomicI32::new(0);
static Y: AtomicI32 = AtomicI32::new(0);
fn main() {
for _ in 0..1_000_000 {
X.store(0, Ordering::Relaxed);
Y.store(0, Ordering::Relaxed);
let t1 = thread::spawn(|| {
X.store(1, Ordering::Relaxed);
Y.load(Ordering::Relaxed)
});
let t2 = thread::spawn(|| {
Y.store(1, Ordering::Relaxed);
X.load(Ordering::Relaxed)
});
let r1 = t1.join().unwrap();
let r2 = t2.join().unwrap();
if r1 == 0 && r2 == 0 {
println!("Both threads saw 0! This 'impossible' outcome happened.");
break;
}
}
}
Intuitively, this should be impossible. Thread 1 stores X=1 then reads Y. Thread 2 stores Y=1 then reads X. At least one of them should see the other’s store, right?
Wrong. On ARM (and other weakly-ordered architectures), both threads can read 0. Each CPU’s store sits in a store buffer, invisible to other cores, while the loads execute. The stores eventually become visible, but by then both loads have completed.
On x86, you won’t see this particular outcome due to its stronger memory model, but other litmus tests will fail. The lesson: if you test only on x86, you’re testing on easy mode.
With SeqCst, this outcome is impossible. The total order guarantees that if thread 1’s store happens before thread 2’s load (in the global order), thread 2 sees it.
Choosing the Right Ordering
Here’s my decision framework:
Start with SeqCst. It’s the easiest to reason about and matches programmer intuition. Make your code correct first.
Profile under realistic load. Atomic operations are rarely the bottleneck. I’ve seen developers spend days optimizing atomics that account for 0.1% of runtime.
If atomics are hot, consider Acquire/Release first. This middle ground provides happens-before relationships without the full cost of SeqCst. The producer-consumer example above works correctly with Release on stores and Acquire on loads:
DATA.store(42, Ordering::Release);
READY.store(true, Ordering::Release);
// ...
while !READY.load(Ordering::Acquire) {}
assert_eq!(DATA.load(Ordering::Acquire), 42);
Use Relaxed only for truly independent operations. Counters, statistics, and progress indicators are good candidates. Anything involving coordination between multiple variables is not.
Document your reasoning. Future you (or your colleagues) will not remember why you chose Relaxed. Leave a comment explaining the safety argument.
Practical Takeaways
Memory ordering is one of those areas where getting clever costs more than it saves. Here are the rules I follow:
Prefer higher-level abstractions. Channels, mutexes, and concurrent data structures handle ordering correctly. Use them unless you have a compelling reason not to.
Benchmark before optimizing. The difference between SeqCst and Relaxed is often nanoseconds. That matters in a tight loop executed billions of times; it doesn’t matter in code that runs occasionally.
Test on ARM. If you’re targeting multiple architectures—and you probably are, given the rise of ARM servers and Apple Silicon—test on the weakest memory model you’ll encounter. x86’s strong model hides bugs that ARM will expose.
Use tools. ThreadSanitizer catches many ordering bugs. Loom (for Rust) exhaustively tests concurrent code under different interleavings. The C++ standard library has std::atomic_thread_fence for when you need explicit barriers.
When in doubt, choose correctness. A program that’s 10% slower but always works beats a program that’s fast but occasionally corrupts data. You can always optimize later; you can’t un-corrupt a database.
Memory ordering is subtle, but it’s not magic. The rules exist, they’re documented, and with practice, you’ll develop intuition for when relaxed ordering is safe and when it’s asking for trouble. Start conservative, measure, and relax constraints only when you understand exactly why it’s safe to do so.