Work Stealing: Load Balancing in Thread Pools

Thread pools typically distribute work using a shared queue: tasks go in, worker threads pull them out. This works fine when tasks take roughly the same time. But reality is messier. Parse one JSON...

Key Insights

  • Work stealing solves the fundamental problem of unpredictable task durations by letting idle threads dynamically take work from busy ones, achieving near-optimal load balancing without central coordination.
  • The Chase-Lev deque enables lock-free work stealing where the owning thread operates on one end while thieves steal from the other, minimizing contention in the common case.
  • Work stealing shines for recursive, divide-and-conquer workloads but adds overhead for uniform, predictable tasks—choose your scheduling strategy based on your workload characteristics.

The Problem with Static Work Distribution

Thread pools typically distribute work using a shared queue: tasks go in, worker threads pull them out. This works fine when tasks take roughly the same time. But reality is messier. Parse one JSON document in microseconds, another in milliseconds. Process one image that’s 100KB, the next that’s 10MB.

Static partitioning—dividing N tasks among M threads upfront—fails spectacularly here. One thread finishes early and sits idle while another grinds through a monster task. You’ve paid for parallelism you’re not using.

Work stealing flips the model. Instead of a central queue, each thread maintains its own local queue. When a thread runs out of work, it doesn’t idle—it steals from a busy neighbor. Load balances dynamically, automatically, without central coordination.

How Work Stealing Works

The core data structure is a double-ended queue (deque) per worker thread. The owning thread pushes and pops from the bottom (LIFO order), while thieves steal from the top (FIFO order).

This asymmetry is deliberate. The owner works on recently-pushed tasks, which are likely still in cache. Thieves take older tasks from the top, which are typically larger (in recursive algorithms, earlier tasks haven’t been subdivided yet).

// Conceptual deque interface
typedef struct {
    Task** buffer;
    int64_t top;    // Thieves steal from here
    int64_t bottom; // Owner pushes/pops here
} WorkStealingDeque;

void push_bottom(WorkStealingDeque* deque, Task* task) {
    deque->buffer[deque->bottom] = task;
    deque->bottom++;
}

Task* pop_bottom(WorkStealingDeque* deque) {
    if (deque->bottom <= deque->top) return NULL;
    deque->bottom--;
    return deque->buffer[deque->bottom];
}

Task* steal(WorkStealingDeque* deque) {
    int64_t top = deque->top;
    if (top >= deque->bottom) return NULL;
    Task* task = deque->buffer[top];
    deque->top = top + 1;
    return task;
}

This naive version has race conditions everywhere. The real challenge is making it thread-safe without destroying performance.

The Chase-Lev Deque

The Chase-Lev algorithm (2005) provides a lock-free work-stealing deque. The key insight: the owner thread never contends with itself, and stealing is rare enough that we can handle contention with atomic compare-and-swap.

#include <atomic>
#include <vector>

template<typename T>
class ChaselevDeque {
    std::atomic<int64_t> top_{0};
    std::atomic<int64_t> bottom_{0};
    std::atomic<std::vector<T>*> buffer_;
    
public:
    ChaselevDeque(size_t capacity) {
        buffer_.store(new std::vector<T>(capacity));
    }
    
    void push(T item) {
        int64_t b = bottom_.load(std::memory_order_relaxed);
        int64_t t = top_.load(std::memory_order_acquire);
        auto* buf = buffer_.load(std::memory_order_relaxed);
        
        if (b - t >= static_cast<int64_t>(buf->size())) {
            // Resize needed - omitted for brevity
        }
        
        (*buf)[b % buf->size()] = item;
        std::atomic_thread_fence(std::memory_order_release);
        bottom_.store(b + 1, std::memory_order_relaxed);
    }
    
    T pop() {
        int64_t b = bottom_.load(std::memory_order_relaxed) - 1;
        bottom_.store(b, std::memory_order_relaxed);
        std::atomic_thread_fence(std::memory_order_seq_cst);
        int64_t t = top_.load(std::memory_order_relaxed);
        
        if (t <= b) {
            // Non-empty
            auto* buf = buffer_.load(std::memory_order_relaxed);
            T item = (*buf)[b % buf->size()];
            if (t == b) {
                // Last element - race with stealers
                if (!top_.compare_exchange_strong(t, t + 1,
                        std::memory_order_seq_cst,
                        std::memory_order_relaxed)) {
                    // Lost race
                    bottom_.store(b + 1, std::memory_order_relaxed);
                    return T{}; // Empty marker
                }
                bottom_.store(b + 1, std::memory_order_relaxed);
            }
            return item;
        }
        // Empty
        bottom_.store(b + 1, std::memory_order_relaxed);
        return T{};
    }
    
    T steal() {
        int64_t t = top_.load(std::memory_order_acquire);
        std::atomic_thread_fence(std::memory_order_seq_cst);
        int64_t b = bottom_.load(std::memory_order_acquire);
        
        if (t < b) {
            auto* buf = buffer_.load(std::memory_order_relaxed);
            T item = (*buf)[t % buf->size()];
            if (!top_.compare_exchange_strong(t, t + 1,
                    std::memory_order_seq_cst,
                    std::memory_order_relaxed)) {
                // Lost race with another stealer or owner
                return T{};
            }
            return item;
        }
        return T{};
    }
};

The memory ordering is critical. The seq_cst fence between decrementing bottom and reading top in pop() prevents a subtle race where both owner and thief think they can take the last element.

Stealing Strategies

When a thread needs to steal, which victim does it choose? Two main approaches:

Random stealing picks a victim uniformly at random. It’s simple and provably efficient—expected time to find work is O(P) where P is the number of processors. But it ignores topology.

Neighbor stealing tries nearby threads first, exploiting NUMA locality and shared caches. Better for memory-bound workloads, but can cause “stealing storms” where all threads pile onto one busy neighbor.

class StealingStrategy {
    std::vector<ChaselevDeque<Task>*>& deques_;
    thread_local static std::mt19937 rng_;
    
public:
    Task* steal_random(int my_id) {
        std::uniform_int_distribution<int> dist(0, deques_.size() - 1);
        
        int attempts = 0;
        while (attempts < deques_.size() * 2) {
            int victim = dist(rng_);
            if (victim != my_id) {
                Task* task = deques_[victim]->steal();
                if (task) return task;
            }
            attempts++;
            
            // Exponential backoff after failures
            if (attempts > deques_.size()) {
                std::this_thread::sleep_for(
                    std::chrono::microseconds(1 << (attempts - deques_.size()))
                );
            }
        }
        return nullptr;
    }
};

The backoff is essential. Without it, idle threads spin aggressively, wasting power and memory bandwidth. Exponential backoff lets the system settle when work is genuinely exhausted.

Implementing a Work-Stealing Thread Pool

Here’s a complete, minimal work-stealing pool:

#include <thread>
#include <functional>
#include <atomic>

class WorkStealingPool {
    std::vector<std::thread> workers_;
    std::vector<ChaselevDeque<std::function<void()>>> deques_;
    std::atomic<bool> shutdown_{false};
    std::atomic<int> active_tasks_{0};
    
public:
    WorkStealingPool(size_t num_threads) : deques_(num_threads, 1024) {
        for (size_t i = 0; i < num_threads; i++) {
            workers_.emplace_back([this, i] { worker_loop(i); });
        }
    }
    
    void submit(std::function<void()> task) {
        active_tasks_.fetch_add(1, std::memory_order_relaxed);
        // Round-robin submission to distribute initial work
        static std::atomic<size_t> next_worker{0};
        size_t target = next_worker.fetch_add(1) % deques_.size();
        deques_[target].push(std::move(task));
    }
    
    void wait() {
        while (active_tasks_.load(std::memory_order_acquire) > 0) {
            std::this_thread::yield();
        }
    }
    
    ~WorkStealingPool() {
        shutdown_.store(true, std::memory_order_release);
        for (auto& w : workers_) w.join();
    }
    
private:
    void worker_loop(size_t id) {
        while (!shutdown_.load(std::memory_order_acquire)) {
            // Try local queue first
            auto task = deques_[id].pop();
            
            // If empty, try stealing
            if (!task) {
                task = try_steal(id);
            }
            
            if (task) {
                task();
                active_tasks_.fetch_sub(1, std::memory_order_release);
            } else {
                std::this_thread::yield();
            }
        }
    }
    
    std::function<void()> try_steal(size_t my_id) {
        for (size_t i = 0; i < deques_.size(); i++) {
            if (i != my_id) {
                auto task = deques_[i].steal();
                if (task) return task;
            }
        }
        return nullptr;
    }
};

This implementation handles the basics: local-first execution, stealing on idle, and graceful shutdown. Production implementations add task priorities, per-thread submission handles, and more sophisticated termination detection.

Performance Characteristics and Trade-offs

Work stealing isn’t free. Each steal attempt involves atomic operations and potential cache line bouncing. For workloads with perfectly uniform tasks, a simple shared queue with good locking often wins.

Work stealing excels when:

  • Task durations vary significantly (10x+ difference)
  • Tasks spawn subtasks (recursive algorithms)
  • You can’t predict execution time upfront

Work stealing hurts when:

  • Tasks are tiny (overhead dominates)
  • Tasks are uniform (no imbalance to fix)
  • Memory bandwidth is the bottleneck (stealing trashes caches)
// Benchmark: Parallel merge sort - ideal for work stealing
void parallel_merge_sort(WorkStealingPool& pool, int* arr, int n) {
    if (n <= 1024) {
        std::sort(arr, arr + n);
        return;
    }
    
    int mid = n / 2;
    std::atomic<int> done{0};
    
    pool.submit([&] {
        parallel_merge_sort(pool, arr, mid);
        done.fetch_add(1);
    });
    pool.submit([&] {
        parallel_merge_sort(pool, arr + mid, n - mid);
        done.fetch_add(1);
    });
    
    while (done.load() < 2) std::this_thread::yield();
    std::inplace_merge(arr, arr + mid, arr + n);
}

In benchmarks, work stealing typically shows 10-30% improvement over static partitioning for irregular workloads. For recursive divide-and-conquer, the improvement can be 2-3x because work stealing naturally handles the tree-shaped task graph.

Real-World Implementations

Java’s ForkJoinPool pioneered mainstream work stealing. It’s optimized for fork/join parallelism where tasks spawn subtasks. The RecursiveTask and RecursiveAction abstractions make it easy to use.

Rayon (Rust) uses work stealing for its parallel iterators. When you write vec.par_iter().map(...).collect(), Rayon splits the work and uses stealing to balance load. The implementation is highly optimized for Rust’s ownership model.

Tokio (Rust async runtime) uses work stealing for its multi-threaded scheduler. Each worker thread has a local queue of async tasks, and idle workers steal from busy ones. This is crucial for handling I/O-bound workloads where some connections are busier than others.

Intel TBB provides tbb::task_arena with work stealing. It’s particularly optimized for nested parallelism and integrates with Intel’s threading tools.

All of these use variations of Chase-Lev with additional optimizations: better cache alignment, NUMA awareness, and adaptive stealing strategies.

The lesson from these implementations: work stealing is a proven technique for dynamic load balancing. If you’re building a thread pool for irregular workloads, it should be your default choice. Just measure first—for uniform workloads, simpler is better.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.