Threads: OS Threads and User-Space Threads

Key Insights

OS threads provide true parallelism and preemptive scheduling but cost 1-8MB of stack memory each, making millions of concurrent connections impractical without user-space threading abstractions.
User-space threads (green threads) enable massive concurrency through cooperative scheduling and small stacks, but blocking system calls can stall entire worker threads unless the runtime handles them specially.
Modern high-performance systems use M:N hybrid models where a runtime multiplexes thousands of lightweight tasks across a small pool of OS threads, combining the best of both approaches.

Introduction to Threading Models

Every backend engineer eventually confronts the same question: how do I handle 100,000 concurrent connections without spinning up 100,000 OS threads? The answer lies in understanding the fundamental distinction between kernel-managed and user-managed threads.

Concurrency and parallelism are related but distinct concepts. Concurrency means dealing with multiple tasks that can make progress independently—your web server handling requests from different clients. Parallelism means executing multiple tasks simultaneously on different CPU cores. OS threads give you both. User-space threads give you concurrency, with parallelism achieved by spreading work across multiple OS threads underneath.

The threading model you choose determines your application’s memory footprint, latency characteristics, and operational complexity. Get it wrong, and you’ll either exhaust system resources or leave performance on the table.

OS Threads (Kernel Threads) Deep Dive

Operating system threads are the fundamental unit of execution that the kernel scheduler manages. When you create a pthread on Linux or a thread on Windows, you’re asking the kernel to allocate a new execution context with its own stack, register state, and scheduling metadata.

The kernel uses preemptive scheduling—it can interrupt any thread at any time to give another thread CPU time. This happens via timer interrupts, typically every 1-10 milliseconds. The scheduler maintains run queues, handles priority, and ensures fairness across all threads in the system.

The cost of this abstraction is significant. Each thread requires:

Stack memory: Linux defaults to 8MB virtual address space per thread (though physical pages are allocated on demand)
Kernel data structures: task_struct, file descriptor tables, signal handlers
Context switch overhead: Saving and restoring registers, flushing TLB entries, cache pollution

Here’s how to measure these costs directly:

#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <time.h>
#include <unistd.h>

#define NUM_THREADS 1000

void* thread_work(void* arg) {
    // Simulate some work, then exit
    usleep(100000);  // 100ms
    return NULL;
}

long get_memory_kb() {
    struct rusage usage;
    getrusage(RUSAGE_SELF, &usage);
    return usage.ru_maxrss;
}

int main() {
    pthread_t threads[NUM_THREADS];
    pthread_attr_t attr;
    
    pthread_attr_init(&attr);
    // Reduce stack size to 64KB to allow more threads
    pthread_attr_setstacksize(&attr, 64 * 1024);
    
    long mem_before = get_memory_kb();
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < NUM_THREADS; i++) {
        if (pthread_create(&threads[i], &attr, thread_work, NULL) != 0) {
            fprintf(stderr, "Failed to create thread %d\n", i);
            exit(1);
        }
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    long mem_after = get_memory_kb();
    
    double elapsed_ms = (end.tv_sec - start.tv_sec) * 1000.0 +
                        (end.tv_nsec - start.tv_nsec) / 1000000.0;
    
    printf("Created %d threads in %.2f ms\n", NUM_THREADS, elapsed_ms);
    printf("Memory increase: %ld KB (%.2f KB per thread)\n", 
           mem_after - mem_before, 
           (double)(mem_after - mem_before) / NUM_THREADS);
    
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }
    
    pthread_attr_destroy(&attr);
    return 0;
}

On a typical Linux system, creating 1,000 threads takes 50-200ms and consumes roughly 70-100KB per thread even with reduced stack sizes. Try creating 100,000 threads and watch your system grind to a halt.

User-Space Threads (Green Threads)

User-space threads flip the model: instead of the kernel managing execution contexts, your application’s runtime does. The runtime maintains its own scheduler, allocates small stacks (often 2-8KB initially), and switches between tasks without involving the kernel.

The key difference is cooperative scheduling. User-space threads yield control explicitly—at I/O boundaries, channel operations, or explicit yield points. The runtime never preempts a running task mid-computation.

Go’s goroutines are the canonical example of user-space threads done well:

package main

import (
	"fmt"
	"runtime"
	"sync"
	"time"
)

func main() {
	numGoroutines := 100000
	
	var memBefore runtime.MemStats
	runtime.ReadMemStats(&memBefore)
	
	start := time.Now()
	
	var wg sync.WaitGroup
	wg.Add(numGoroutines)
	
	for i := 0; i < numGoroutines; i++ {
		go func() {
			defer wg.Done()
			// Simulate work
			time.Sleep(100 * time.Millisecond)
		}()
	}
	
	elapsed := time.Since(start)
	
	var memAfter runtime.MemStats
	runtime.ReadMemStats(&memAfter)
	
	fmt.Printf("Created %d goroutines in %v\n", numGoroutines, elapsed)
	fmt.Printf("Memory increase: %.2f MB (%.2f KB per goroutine)\n",
		float64(memAfter.Alloc-memBefore.Alloc)/(1024*1024),
		float64(memAfter.Alloc-memBefore.Alloc)/float64(numGoroutines)/1024)
	fmt.Printf("OS threads used: %d\n", runtime.GOMAXPROCS(0))
	
	wg.Wait()
}

This creates 100,000 goroutines in under 100ms, using roughly 2-4KB per goroutine. The Go runtime multiplexes all of these across just a handful of OS threads (typically matching your CPU core count).

There are two main approaches to user-space thread implementation:

Stackful coroutines (Go, Erlang): Each task has its own stack that can grow dynamically. Function calls work normally. The runtime copies or relocates stacks as needed.

Stackless coroutines (Rust async, JavaScript): Tasks are state machines compiled by the compiler. No separate stack—the task’s state lives in a heap-allocated future object. More memory-efficient but requires language support.

The Hybrid Approach: M:N Scheduling

Modern runtimes use M:N scheduling: M user-space threads multiplexed across N OS threads. This captures the memory efficiency of green threads while still achieving true parallelism.

The runtime maintains a pool of worker threads (usually one per CPU core) and a global queue of runnable tasks. When a task blocks on I/O, the runtime parks it and picks up another task on the same worker thread. Work-stealing algorithms let idle workers grab tasks from busy workers’ queues.

Rust’s Tokio runtime makes this configuration explicit:

use std::time::{Duration, Instant};
use tokio::sync::Barrier;
use std::sync::Arc;

#[tokio::main(flavor = "multi_thread", worker_threads = 4)]
async fn main() {
    let num_tasks = 100_000;
    let barrier = Arc::new(Barrier::new(num_tasks + 1));
    
    let start = Instant::now();
    
    let mut handles = Vec::with_capacity(num_tasks);
    
    for _ in 0..num_tasks {
        let barrier = Arc::clone(&barrier);
        handles.push(tokio::spawn(async move {
            // Simulate async I/O
            tokio::time::sleep(Duration::from_millis(100)).await;
            barrier.wait().await;
        }));
    }
    
    // Wait for all tasks to reach the barrier
    barrier.wait().await;
    
    let elapsed = start.elapsed();
    println!("Spawned {} tasks in {:?}", num_tasks, elapsed);
    println!("Worker threads: 4");
    
    for handle in handles {
        handle.await.unwrap();
    }
}

The worker_threads = 4 annotation explicitly sets the OS thread pool size. All 100,000 async tasks share these four threads. When a task awaits I/O, Tokio’s reactor (built on epoll/kqueue/io_uring) handles the notification and reschedules the task when data arrives.

Trade-offs and Performance Characteristics

The choice between threading models involves real trade-offs:

Blocking I/O is poison for green threads. If a goroutine makes a blocking syscall (like reading from a file without O_NONBLOCK), the entire OS thread stalls. Go mitigates this by dynamically spawning additional OS threads, but this can lead to thread explosion under heavy file I/O. Rust async simply doesn’t allow blocking calls in async contexts—you must use spawn_blocking to offload to a dedicated thread pool.

CPU-bound work needs OS threads. User-space schedulers are cooperative. A CPU-bound task that never yields will starve other tasks on the same worker. Go inserts preemption points at function calls, but tight loops can still cause issues.

Debugging is harder. Stack traces for green threads are runtime-dependent. Traditional tools like gdb don’t understand goroutine stacks without special support. Profilers need runtime integration.

Here’s a benchmark showing the I/O throughput difference:

package main

import (
	"fmt"
	"net/http"
	"sync"
	"time"
)

func benchmarkConcurrentRequests(numRequests int, concurrency int) time.Duration {
	sem := make(chan struct{}, concurrency)
	var wg sync.WaitGroup
	
	client := &http.Client{
		Timeout: 10 * time.Second,
	}
	
	start := time.Now()
	
	for i := 0; i < numRequests; i++ {
		wg.Add(1)
		sem <- struct{}{}
		
		go func() {
			defer wg.Done()
			defer func() { <-sem }()
			
			resp, err := client.Get("https://httpbin.org/delay/0")
			if err == nil {
				resp.Body.Close()
			}
		}()
	}
	
	wg.Wait()
	return time.Since(start)
}

func main() {
	// Warm up
	benchmarkConcurrentRequests(10, 10)
	
	for _, concurrency := range []int{10, 100, 1000} {
		elapsed := benchmarkConcurrentRequests(1000, concurrency)
		fmt.Printf("1000 requests @ %d concurrency: %v (%.1f req/s)\n",
			concurrency, elapsed, 1000/elapsed.Seconds())
	}
}

With green threads, you can push concurrency to thousands without memory concerns. Try the same with raw pthreads and you’ll hit limits quickly.

Practical Guidance: Choosing the Right Model

Use this decision framework:

Choose OS threads when:

Your workload is primarily CPU-bound
You need predictable, preemptive scheduling
You’re calling into libraries that make blocking syscalls
Debugging simplicity matters more than maximum concurrency

Choose user-space threads (via Go, Erlang, or async runtimes) when:

You’re building network services handling many concurrent connections
I/O wait time dominates CPU time
Memory efficiency matters (embedded systems, high-density deployments)
Your language ecosystem has mature async support

Emerging models worth watching:

Java 21’s Virtual Threads bring green threads to the JVM with automatic blocking call detection. When a virtual thread makes a blocking call, the runtime automatically unmounts it from the carrier thread.

Linux’s io_uring provides a way to make traditionally blocking operations (file I/O, network I/O) truly asynchronous at the kernel level, reducing the need for thread pools to handle blocking work.

The trend is clear: the industry is converging on M:N models with sophisticated runtimes. But understanding the underlying mechanics—why context switches cost microseconds, why stacks consume memory, why cooperative scheduling requires explicit yields—makes you a better engineer regardless of which abstraction you use.