Spinlock: Busy-Wait Synchronization

Key Insights

Spinlocks trade CPU cycles for latency—they’re optimal when lock hold times are shorter than the cost of a context switch (typically under 1-2 microseconds)
The naive spin loop destroys performance at scale; real implementations use ticket locks, MCS queues, or exponential backoff to prevent cache line thrashing
Spinlocks belong in kernel code, interrupt handlers, and ultra-low-latency paths—if you’re reaching for one in application code, you probably need to reconsider your design

Introduction to Spinlocks

A spinlock is exactly what it sounds like: a lock that spins. When a thread tries to acquire a spinlock that’s already held, it doesn’t go to sleep and wait for the operating system to wake it up. Instead, it sits in a tight loop, repeatedly checking whether the lock has become available.

This seems wasteful, and often it is. But blocking locks carry their own costs. When a thread blocks, the kernel must save its state, update scheduler data structures, potentially flush caches, and later reverse the entire process when the lock becomes available. On modern hardware, this context switch overhead runs anywhere from 1 to 10 microseconds.

If your critical section takes 50 nanoseconds, paying 5 microseconds to block is absurd. You’d rather burn a few hundred CPU cycles spinning than invoke the kernel’s full scheduling machinery. That’s the spinlock’s value proposition: trade CPU time for latency when you know the wait will be short.

How Spinlocks Work

At its core, a spinlock is an atomic variable and a loop. The lock operation attempts to atomically change the variable from “unlocked” to “locked.” If successful, the thread proceeds. If not, it tries again. And again. And again.

The atomic operation is crucial. Without it, two threads could both read “unlocked,” both write “locked,” and both believe they hold the lock. Modern CPUs provide atomic instructions like test-and-set, compare-and-swap, and atomic exchange specifically for this purpose.

Here’s a minimal spinlock implementation in C:

#include <stdatomic.h>
#include <stdbool.h>

typedef struct {
    atomic_flag flag;
} spinlock_t;

void spinlock_init(spinlock_t *lock) {
    atomic_flag_clear(&lock->flag);
}

void spinlock_acquire(spinlock_t *lock) {
    while (atomic_flag_test_and_set_explicit(&lock->flag, memory_order_acquire)) {
        // Spin until we acquire the lock
        // On x86, PAUSE instruction reduces power and improves performance
        __builtin_ia32_pause();
    }
}

void spinlock_release(spinlock_t *lock) {
    atomic_flag_clear_explicit(&lock->flag, memory_order_release);
}

The memory ordering matters. memory_order_acquire on lock acquisition ensures that all reads and writes after the lock is acquired see any writes made before the previous holder released it. memory_order_release on unlock ensures our writes are visible before another thread can acquire the lock.

The PAUSE instruction (exposed via the compiler intrinsic) tells the CPU we’re in a spin loop. This reduces power consumption and, on hyperthreaded cores, yields execution resources to the sibling thread.

When to Use Spinlocks

Spinlocks make sense in specific circumstances:

Short critical sections: If the protected code runs in under a microsecond, spinning is almost always faster than blocking. Database engines use spinlocks to protect buffer pool page latches because the typical operation—checking a flag, incrementing a counter—takes nanoseconds.

Multi-core systems: On a single core, a spinlock is worse than useless. If the lock holder isn’t running, it can’t release the lock, and the spinner will burn its entire time slice accomplishing nothing. With multiple cores, the holder can run simultaneously with the spinner.

Interrupt and kernel contexts: In interrupt handlers, you often can’t block—there’s no thread context to save. The Linux kernel uses spinlocks extensively in these situations. When you disable interrupts and grab a spinlock, you’re guaranteed the holder is running on another CPU (not preempted on yours), so the wait is bounded.

Known contention patterns: If you know that contention is rare and brief, spinlocks avoid the overhead of futex syscalls and scheduler involvement that mutexes require.

The crossover point varies by system, but a reasonable rule of thumb: if your critical section exceeds 1-2 microseconds, or if contention is high, use a blocking lock.

Performance Characteristics and Pitfalls

Spinlocks have sharp edges. Used incorrectly, they’ll devastate your system’s performance.

CPU consumption: A spinning thread consumes 100% of its core. In cloud environments, you’re paying for those cycles. In battery-powered devices, you’re draining power. In latency-sensitive systems, you might be starving other work.

Cache line bouncing: The naive implementation above has a brutal scaling problem. Every spinner repeatedly writes to the same cache line (the lock variable). On a multi-socket system, this cache line bounces between cores via the interconnect, consuming memory bandwidth and adding latency.

Priority inversion: If a low-priority thread holds a spinlock and a high-priority thread spins waiting for it, but a medium-priority thread preempts the low-priority holder, the high-priority thread waits indefinitely. This is why real-time systems treat spinlocks carefully.

Here’s a benchmark comparing spinlock and mutex performance across different critical section lengths:

#include <pthread.h>
#include <stdatomic.h>
#include <stdio.h>
#include <time.h>

#define ITERATIONS 1000000

spinlock_t spin;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

volatile int shared_counter = 0;

void work_simulation(int cycles) {
    for (volatile int i = 0; i < cycles; i++) {
        // Simulate work inside critical section
    }
}

double benchmark_spinlock(int work_cycles) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < ITERATIONS; i++) {
        spinlock_acquire(&spin);
        shared_counter++;
        work_simulation(work_cycles);
        spinlock_release(&spin);
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;
}

double benchmark_mutex(int work_cycles) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < ITERATIONS; i++) {
        pthread_mutex_lock(&mutex);
        shared_counter++;
        work_simulation(work_cycles);
        pthread_mutex_unlock(&mutex);
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;
}

int main() {
    spinlock_init(&spin);
    
    int work_cycles[] = {0, 10, 100, 1000, 10000};
    
    printf("Work Cycles | Spinlock (s) | Mutex (s)\n");
    printf("------------|--------------|----------\n");
    
    for (int i = 0; i < 5; i++) {
        double spin_time = benchmark_spinlock(work_cycles[i]);
        double mutex_time = benchmark_mutex(work_cycles[i]);
        printf("%11d | %12.4f | %9.4f\n", 
               work_cycles[i], spin_time, mutex_time);
    }
    
    return 0;
}

On most systems, you’ll see spinlocks win at low work cycles and mutexes win as critical section length increases.

Spinlock Variants and Optimizations

The basic test-and-set spinlock doesn’t scale. Better algorithms exist.

Ticket locks provide fairness. Each arriving thread takes a ticket number and waits until that number is served. No starvation, predictable ordering:

#include <stdatomic.h>

typedef struct {
    atomic_uint next_ticket;
    atomic_uint now_serving;
} ticket_lock_t;

void ticket_lock_init(ticket_lock_t *lock) {
    atomic_store(&lock->next_ticket, 0);
    atomic_store(&lock->now_serving, 0);
}

void ticket_lock_acquire(ticket_lock_t *lock) {
    unsigned int my_ticket = atomic_fetch_add_explicit(
        &lock->next_ticket, 1, memory_order_relaxed);
    
    while (atomic_load_explicit(&lock->now_serving, 
                                 memory_order_acquire) != my_ticket) {
        __builtin_ia32_pause();
    }
}

void ticket_lock_release(ticket_lock_t *lock) {
    unsigned int current = atomic_load_explicit(
        &lock->now_serving, memory_order_relaxed);
    atomic_store_explicit(&lock->now_serving, current + 1, 
                          memory_order_release);
}

MCS locks go further by giving each waiter its own cache line to spin on, eliminating cache line bouncing entirely. Each thread spins on a local variable; the previous holder writes to it when releasing.

Exponential backoff reduces contention by having spinners wait progressively longer between attempts:

void spinlock_acquire_backoff(spinlock_t *lock) {
    int backoff = 1;
    const int max_backoff = 1024;
    
    while (atomic_flag_test_and_set_explicit(&lock->flag, 
                                              memory_order_acquire)) {
        for (int i = 0; i < backoff; i++) {
            __builtin_ia32_pause();
        }
        backoff = (backoff < max_backoff) ? backoff * 2 : max_backoff;
    }
}

Spinlocks in Practice

The Linux kernel uses spinlocks pervasively. The spinlock_t type protects scheduler data structures, device registers, and countless kernel subsystems. The kernel’s implementation automatically disables preemption (and optionally interrupts) to prevent the holder from being scheduled out.

Database engines like MySQL’s InnoDB use spinlocks for buffer pool management. PostgreSQL’s LWLock infrastructure includes spin-wait phases before falling back to semaphore-based blocking.

In Rust, the parking_lot crate provides spinlock-based primitives that are faster than the standard library’s mutex for low-contention scenarios. C++ offers std::atomic_flag for building spinlocks, though the standard library doesn’t provide one directly.

Here’s a reader-writer spinlock that allows multiple concurrent readers:

typedef struct {
    atomic_int state;  // -1 = write locked, 0 = unlocked, >0 = reader count
} rwspinlock_t;

void rwspin_read_lock(rwspinlock_t *lock) {
    while (1) {
        int current = atomic_load_explicit(&lock->state, memory_order_relaxed);
        if (current >= 0) {
            if (atomic_compare_exchange_weak_explicit(
                    &lock->state, &current, current + 1,
                    memory_order_acquire, memory_order_relaxed)) {
                return;
            }
        }
        __builtin_ia32_pause();
    }
}

void rwspin_read_unlock(rwspinlock_t *lock) {
    atomic_fetch_sub_explicit(&lock->state, 1, memory_order_release);
}

void rwspin_write_lock(rwspinlock_t *lock) {
    int expected = 0;
    while (!atomic_compare_exchange_weak_explicit(
            &lock->state, &expected, -1,
            memory_order_acquire, memory_order_relaxed)) {
        expected = 0;
        __builtin_ia32_pause();
    }
}

void rwspin_write_unlock(rwspinlock_t *lock) {
    atomic_store_explicit(&lock->state, 0, memory_order_release);
}

Conclusion

Spinlocks are a precision tool. They excel when critical sections are measured in nanoseconds, when you’re on bare metal or in kernel space, and when blocking overhead dominates your performance profile.

For most application code, mutexes are the right choice. They handle the complex cases—long waits, priority inheritance, debugging support—that spinlocks ignore. But when you need the absolute minimum latency for the shortest possible critical section, and you understand the costs, spinlocks deliver.

The decision framework is straightforward: measure your critical section duration, understand your contention patterns, and benchmark both approaches. If spinlocks win by enough to matter, use them. If not, stick with the abstractions your runtime provides.