Green Threads: User-Space Thread Scheduling

Key Insights

Green threads enable spawning millions of concurrent tasks with minimal memory overhead by moving thread scheduling from the kernel into user space, trading OS-level preemption for application-controlled context switching.
The critical challenge in green thread implementations isn’t scheduling—it’s handling blocking I/O without stalling the entire runtime, which requires integration with OS event notification systems like epoll or io_uring.
Green threads excel for I/O-bound workloads with massive concurrency requirements, but they introduce complexity around FFI calls, CPU-bound tasks, and debugging that you must plan for upfront.

What Are Green Threads?

Green threads are threads scheduled entirely in user space rather than by the operating system kernel. Your application maintains its own scheduler, manages its own thread control blocks, and decides when to switch between execution contexts—all without kernel involvement.

The term “green” comes from the Green Team at Sun Microsystems, who implemented this model in early Java versions before native threading became reliable across platforms. The concept predates that name, though. Erlang’s lightweight processes (1986) and various Lisp implementations used similar approaches decades ago.

Today, green threads power some of the most scalable systems in production. Go’s goroutines, Erlang’s processes, and Rust’s async tasks all implement variations of this pattern. When you hear about a web server handling millions of concurrent connections, green threads are usually involved.

Kernel Threads vs. Green Threads: The Trade-offs

Kernel threads carry significant overhead. Each thread requires a kernel data structure, a dedicated stack (typically 1-8 MB), and context switches that involve transitioning to kernel mode, saving full CPU state, and potentially flushing TLB entries.

Green threads flip these costs. A green thread might have a 2 KB stack. Context switching happens entirely in user space—no syscalls, no mode transitions. The scheduler is just application code.

Here’s a concrete comparison:

use std::thread;
use std::time::Instant;

fn benchmark_os_threads(count: usize) {
    let start = Instant::now();
    let handles: Vec<_> = (0..count)
        .map(|_| thread::spawn(|| {
            // Minimal work
            std::hint::black_box(1 + 1);
        }))
        .collect();
    
    for handle in handles {
        handle.join().unwrap();
    }
    println!("OS threads ({}): {:?}", count, start.elapsed());
}

async fn async_task() {
    tokio::task::yield_now().await;
}

#[tokio::main(flavor = "current_thread")]
async fn benchmark_green_threads(count: usize) {
    let start = Instant::now();
    let handles: Vec<_> = (0..count)
        .map(|_| tokio::spawn(async_task()))
        .collect();
    
    for handle in handles {
        handle.await.unwrap();
    }
    println!("Green threads ({}): {:?}", count, start.elapsed());
}

// Typical results on modern hardware:
// OS threads (10000): 1.2s, ~80GB virtual memory
// Green threads (10000): 15ms, ~50MB memory
// OS threads (100000): fails or takes 30s+
// Green threads (100000): 150ms, ~500MB memory

The difference becomes stark at scale. You simply cannot spawn 100,000 OS threads on most systems—you’ll hit memory limits or kernel resource caps. Green threads make this trivial.

Anatomy of a User-Space Scheduler

A minimal green thread runtime needs three components: a structure representing each thread’s state, a collection of runnable threads, and logic to switch between them.

#include <stdint.h>
#include <stdlib.h>
#include <string.h>

#define STACK_SIZE 8192
#define MAX_THREADS 256

typedef enum { READY, RUNNING, BLOCKED, DEAD } ThreadState;

typedef struct {
    uint64_t rsp;           // Stack pointer
    uint64_t rbp;           // Base pointer
    uint64_t rbx, r12, r13, r14, r15;  // Callee-saved registers
    void *stack_base;
    ThreadState state;
    int id;
} GreenThread;

typedef struct {
    GreenThread threads[MAX_THREADS];
    int current;
    int count;
} Scheduler;

static Scheduler scheduler = {0};

void scheduler_init(void) {
    scheduler.current = -1;
    scheduler.count = 0;
}

int thread_create(void (*entry)(void)) {
    if (scheduler.count >= MAX_THREADS) return -1;
    
    GreenThread *t = &scheduler.threads[scheduler.count];
    t->id = scheduler.count;
    t->state = READY;
    t->stack_base = aligned_alloc(16, STACK_SIZE);
    
    // Set up initial stack frame
    uint64_t *stack_top = (uint64_t *)((char *)t->stack_base + STACK_SIZE);
    stack_top[-1] = (uint64_t)entry;  // Return address
    t->rsp = (uint64_t)&stack_top[-1];
    
    return scheduler.count++;
}

// Round-robin scheduling
int schedule_next(void) {
    int start = (scheduler.current + 1) % scheduler.count;
    for (int i = 0; i < scheduler.count; i++) {
        int idx = (start + i) % scheduler.count;
        if (scheduler.threads[idx].state == READY) {
            return idx;
        }
    }
    return -1;  // No runnable threads
}

This scheduler uses cooperative multitasking—threads must explicitly yield control. Preemptive user-space scheduling is possible using signals (SIGALRM or SIGVTALRM), but it adds complexity and potential for subtle bugs.

Stack Management and Context Switching

The context switch is where green threads get interesting. You need to save the current thread’s registers and stack pointer, then restore the next thread’s state.

; x86-64 context switch
; void context_switch(GreenThread *old, GreenThread *new)
; old in %rdi, new in %rsi
.global context_switch
context_switch:
    ; Save callee-saved registers to old thread
    mov %rsp, 0(%rdi)      ; Save stack pointer
    mov %rbp, 8(%rdi)      ; Save base pointer
    mov %rbx, 16(%rdi)
    mov %r12, 24(%rdi)
    mov %r13, 32(%rdi)
    mov %r14, 40(%rdi)
    mov %r15, 48(%rdi)
    
    ; Restore callee-saved registers from new thread
    mov 0(%rsi), %rsp      ; Restore stack pointer
    mov 8(%rsi), %rbp
    mov 16(%rsi), %rbx
    mov 24(%rsi), %r12
    mov 32(%rsi), %r13
    mov 40(%rsi), %r14
    mov 48(%rsi), %r15
    
    ret                     ; Return to new thread's saved location

Stack allocation is a design decision with real trade-offs. Fixed-size stacks are simple but wasteful. Segmented stacks (used by early Go) allow growth but have performance cliffs at segment boundaries. Modern Go uses copying stacks—when a goroutine needs more space, the runtime allocates a larger stack and copies everything over.

Handling Blocking Operations

Here’s the hard problem: what happens when a green thread calls read() on a socket with no data available? The entire OS thread blocks, freezing all green threads scheduled on it.

The solution is to never let green threads make blocking syscalls directly. Instead, wrap I/O operations to check readiness first, yielding if the operation would block:

use std::io::{self, Read};
use std::os::unix::io::AsRawFd;

pub struct NonBlockingReader<R: Read + AsRawFd> {
    inner: R,
    scheduler: SchedulerHandle,
}

impl<R: Read + AsRawFd> NonBlockingReader<R> {
    pub fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        loop {
            // Try non-blocking read
            match self.try_read(buf) {
                Ok(n) => return Ok(n),
                Err(ref e) if e.kind() == io::ErrorKind::WouldBlock => {
                    // Register interest with event loop
                    self.scheduler.register_read(self.inner.as_raw_fd());
                    // Yield to scheduler - we'll be woken when data arrives
                    self.scheduler.yield_current();
                }
                Err(e) => return Err(e),
            }
        }
    }
}

This pattern requires integration with the OS event notification system. On Linux, you register file descriptors with epoll; when they become ready, the scheduler marks the corresponding green threads as runnable.

Most production runtimes use M:N threading: M green threads multiplexed across N OS threads. This lets you utilize multiple CPU cores while keeping green thread overhead low. When one OS thread blocks on a syscall that can’t be made non-blocking (like file I/O on Linux), other OS threads continue running green threads.

Real-World Implementations

Go’s goroutines use a work-stealing scheduler. Each OS thread (called P in Go’s terminology) has a local run queue. When a P’s queue empties, it steals work from other Ps. Goroutines start with 2 KB stacks that grow as needed. The runtime automatically moves goroutines between OS threads when they block.

Rust’s tokio takes a different approach. It’s built on async/await, which transforms async functions into state machines at compile time. There’s no stack per task—state lives in heap-allocated futures. This is more memory-efficient but requires explicit async annotations.

Erlang’s processes are the most isolated. Each process has its own heap, and communication happens only through message passing. This enables per-process garbage collection and makes the system incredibly fault-tolerant—one process crashing can’t corrupt another’s memory.

Python’s greenlet provides the primitive for libraries like gevent. It’s cooperative-only and requires monkey-patching the standard library to make I/O non-blocking. The GIL means you’re limited to one OS thread anyway, so M:N threading isn’t relevant.

When to Use Green Threads

Green threads shine when you have thousands of concurrent connections doing mostly I/O. Web servers, proxies, chat systems, game servers—anything where connections spend most of their time waiting.

Avoid green threads when:

Your workload is CPU-bound. Green threads don’t parallelize computation. You need OS threads or processes for that.
You call into C libraries that block. FFI calls can stall your entire runtime. Either use a thread pool for blocking FFI or find async-compatible libraries.
Debugging matters more than performance. Stack traces in green thread systems are notoriously confusing. Profiling is harder. Race conditions become more subtle.

The decision framework is straightforward: if you need more than ~1000 concurrent tasks and they’re I/O-bound, green threads are probably right. If you’re under that threshold, OS threads are simpler and the overhead doesn’t matter. If you’re CPU-bound, use OS threads or processes regardless of concurrency level.

Green threads aren’t magic—they’re a trade-off. You’re accepting runtime complexity, debugging difficulty, and ecosystem constraints in exchange for massive concurrency at low cost. Make that trade consciously.