Python multiprocessing: True Parallelism

Python's Global Interpreter Lock is the elephant in the room for anyone trying to speed up CPU-intensive code. The GIL is a mutex that protects access to Python objects, preventing multiple threads...

Key Insights

  • Python’s Global Interpreter Lock (GIL) prevents true parallelism in threads, but the multiprocessing module bypasses this by spawning separate Python interpreters, enabling genuine CPU-bound parallel execution.
  • Process pools with Pool.map() provide the cleanest abstraction for embarrassingly parallel workloads, but you must understand pickling constraints and memory overhead to avoid common pitfalls.
  • Choose multiprocessing for CPU-bound tasks, threading for I/O-bound tasks with shared state, and asyncio for high-concurrency I/O—there’s no universal “best” approach.

The GIL Problem

Python’s Global Interpreter Lock is the elephant in the room for anyone trying to speed up CPU-intensive code. The GIL is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecode simultaneously. This means your multi-threaded Python code runs on a single CPU core, no matter how many threads you spawn.

Let’s prove this with a benchmark:

import threading
import multiprocessing
import time

def cpu_intensive_task(n):
    """Simulate CPU-bound work with prime calculation."""
    count = 0
    for i in range(2, n):
        if all(i % j != 0 for j in range(2, int(i ** 0.5) + 1)):
            count += 1
    return count

def benchmark_sequential(iterations, n):
    start = time.perf_counter()
    for _ in range(iterations):
        cpu_intensive_task(n)
    return time.perf_counter() - start

def benchmark_threading(iterations, n):
    start = time.perf_counter()
    threads = [threading.Thread(target=cpu_intensive_task, args=(n,)) 
               for _ in range(iterations)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return time.perf_counter() - start

def benchmark_multiprocessing(iterations, n):
    start = time.perf_counter()
    processes = [multiprocessing.Process(target=cpu_intensive_task, args=(n,)) 
                 for _ in range(iterations)]
    for p in processes:
        p.start()
    for p in processes:
        p.join()
    return time.perf_counter() - start

if __name__ == "__main__":
    iterations, n = 4, 50000
    
    print(f"Sequential:      {benchmark_sequential(iterations, n):.2f}s")
    print(f"Threading:       {benchmark_threading(iterations, n):.2f}s")
    print(f"Multiprocessing: {benchmark_multiprocessing(iterations, n):.2f}s")

On a 4-core machine, typical results look like:

Sequential:      8.42s
Threading:       8.67s
Multiprocessing: 2.31s

Threading is actually slower than sequential due to context-switching overhead. Multiprocessing achieves near-linear speedup because each process has its own GIL and runs on a separate core.

Multiprocessing Fundamentals

The multiprocessing module creates entirely separate Python interpreter processes. Each process has its own memory space, GIL, and Python runtime. This isolation is both the source of its power and its complexity.

import multiprocessing
import os

def worker(name):
    print(f"Worker {name}: PID={os.getpid()}, Parent PID={os.getppid()}")
    return f"Result from {name}"

if __name__ == "__main__":
    print(f"Main process: PID={os.getpid()}")
    
    # Create processes
    p1 = multiprocessing.Process(target=worker, args=("A",))
    p2 = multiprocessing.Process(target=worker, args=("B",))
    
    # Start processes (non-blocking)
    p1.start()
    p2.start()
    
    # Wait for completion
    p1.join()
    p2.join()
    
    print(f"Process A exit code: {p1.exitcode}")
    print(f"Process B exit code: {p2.exitcode}")

The lifecycle is straightforward: create a Process object with a target function, call start() to spawn the subprocess, and join() to block until completion. The exitcode attribute tells you whether the process succeeded (0) or failed.

Process Pools for Parallel Workloads

Manually managing processes is tedious. For most parallel workloads, Pool provides a cleaner abstraction that manages a fixed number of worker processes.

import multiprocessing
from pathlib import Path
import hashlib

def compute_file_hash(filepath):
    """Compute SHA-256 hash of a file."""
    hasher = hashlib.sha256()
    with open(filepath, 'rb') as f:
        for chunk in iter(lambda: f.read(8192), b''):
            hasher.update(chunk)
    return filepath, hasher.hexdigest()

def process_files_parallel(file_paths, num_workers=None):
    """Process files in parallel using a process pool."""
    with multiprocessing.Pool(processes=num_workers) as pool:
        results = pool.map(compute_file_hash, file_paths)
    return dict(results)

if __name__ == "__main__":
    files = list(Path("/usr/lib").glob("*.so*"))[:20]
    
    hashes = process_files_parallel(files, num_workers=4)
    for path, hash_val in list(hashes.items())[:3]:
        print(f"{path.name}: {hash_val[:16]}...")

The context manager ensures proper cleanup. Pool.map() distributes work across workers and collects results in order. For more control, use apply_async():

def process_with_callbacks(file_paths):
    results = {}
    errors = []
    
    def on_success(result):
        filepath, hash_val = result
        results[filepath] = hash_val
    
    def on_error(exc):
        errors.append(str(exc))
    
    with multiprocessing.Pool(4) as pool:
        async_results = [
            pool.apply_async(
                compute_file_hash, 
                (fp,), 
                callback=on_success,
                error_callback=on_error
            )
            for fp in file_paths
        ]
        
        # Wait for all tasks
        for ar in async_results:
            ar.wait()
    
    return results, errors

Inter-Process Communication

Since processes don’t share memory by default, you need explicit mechanisms to exchange data. Queue is the workhorse for most scenarios.

import multiprocessing
import time
import random

def producer(queue, num_items):
    """Generate work items and put them on the queue."""
    for i in range(num_items):
        item = {"id": i, "data": random.randint(1, 100)}
        queue.put(item)
        print(f"Produced: {item}")
        time.sleep(0.1)
    
    # Signal completion
    queue.put(None)

def consumer(queue, results_queue, name):
    """Process items from the queue."""
    while True:
        item = queue.get()
        if item is None:
            queue.put(None)  # Propagate sentinel for other consumers
            break
        
        # Simulate processing
        result = item["data"] ** 2
        results_queue.put({"id": item["id"], "result": result})
        print(f"Consumer {name} processed item {item['id']}: {result}")

if __name__ == "__main__":
    work_queue = multiprocessing.Queue()
    results_queue = multiprocessing.Queue()
    
    producer_proc = multiprocessing.Process(
        target=producer, args=(work_queue, 10)
    )
    consumers = [
        multiprocessing.Process(target=consumer, args=(work_queue, results_queue, i))
        for i in range(3)
    ]
    
    producer_proc.start()
    for c in consumers:
        c.start()
    
    producer_proc.join()
    for c in consumers:
        c.join()
    
    # Collect results
    results = []
    while not results_queue.empty():
        results.append(results_queue.get())
    
    print(f"\nCollected {len(results)} results")

For simple shared values, use Value and Array for lock-free access to shared memory:

from multiprocessing import Process, Value, Array

def modify_shared(counter, arr):
    counter.value += 1
    for i in range(len(arr)):
        arr[i] *= 2

if __name__ == "__main__":
    counter = Value('i', 0)  # 'i' = signed integer
    arr = Array('d', [1.0, 2.0, 3.0])  # 'd' = double
    
    processes = [Process(target=modify_shared, args=(counter, arr)) for _ in range(4)]
    for p in processes:
        p.start()
    for p in processes:
        p.join()
    
    print(f"Counter: {counter.value}")  # May not be 4 due to race condition!
    print(f"Array: {list(arr)}")

Synchronization Primitives

The previous example has a race condition. Multiple processes reading and incrementing counter.value simultaneously can lose updates. Use Lock to fix this:

from multiprocessing import Process, Value, Lock

def safe_increment(counter, lock, iterations):
    for _ in range(iterations):
        with lock:
            counter.value += 1

if __name__ == "__main__":
    counter = Value('i', 0)
    lock = Lock()
    iterations = 10000
    
    processes = [
        Process(target=safe_increment, args=(counter, lock, iterations))
        for _ in range(4)
    ]
    
    for p in processes:
        p.start()
    for p in processes:
        p.join()
    
    expected = 4 * iterations
    print(f"Counter: {counter.value} (expected: {expected})")
    assert counter.value == expected, "Race condition detected!"

For complex shared state, Manager provides synchronized versions of Python objects:

from multiprocessing import Process, Manager

def worker(shared_dict, shared_list, worker_id):
    shared_dict[f"worker_{worker_id}"] = worker_id * 10
    shared_list.append(worker_id)

if __name__ == "__main__":
    with Manager() as manager:
        shared_dict = manager.dict()
        shared_list = manager.list()
        
        processes = [
            Process(target=worker, args=(shared_dict, shared_list, i))
            for i in range(4)
        ]
        
        for p in processes:
            p.start()
        for p in processes:
            p.join()
        
        print(f"Dict: {dict(shared_dict)}")
        print(f"List: {list(shared_list)}")

Common Pitfalls and Best Practices

Pickling failures are the most common multiprocessing headache. Everything passed to a subprocess must be serializable:

import multiprocessing

def process_with_func(func, data):
    return func(data)

if __name__ == "__main__":
    # This will fail!
    transform = lambda x: x * 2
    
    with multiprocessing.Pool(2) as pool:
        try:
            result = pool.apply(process_with_func, (transform, 10))
        except Exception as e:
            print(f"Failed: {e}")
            # Can't pickle <function <lambda> at 0x...>
    
    # Solution: use a named function
    def double(x):
        return x * 2
    
    with multiprocessing.Pool(2) as pool:
        result = pool.apply(process_with_func, (double, 10))
        print(f"Success: {result}")

Always use the if __name__ == "__main__" guard. On Windows and macOS (with spawn start method), the main module is re-imported in child processes. Without the guard, you get infinite process spawning.

Memory overhead is real. Each process duplicates the Python interpreter and any data you pass to it. For large datasets, consider memory-mapped files or passing file paths instead of data.

When to Use What

Here’s a practical decision framework with a side-by-side comparison:

import asyncio
import threading
import multiprocessing
import time
import urllib.request

URLS = ["https://httpbin.org/delay/1"] * 4

# I/O-bound: asyncio wins
async def fetch_async(url):
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, urllib.request.urlopen, url)

async def asyncio_approach():
    tasks = [fetch_async(url) for url in URLS]
    return await asyncio.gather(*tasks)

# I/O-bound: threading works too
def fetch_sync(url):
    return urllib.request.urlopen(url).read()

def threading_approach():
    threads = [threading.Thread(target=fetch_sync, args=(url,)) for url in URLS]
    for t in threads:
        t.start()
    for t in threads:
        t.join()

# CPU-bound: multiprocessing wins
def cpu_work(n):
    return sum(i * i for i in range(n))

def multiprocessing_approach():
    with multiprocessing.Pool(4) as pool:
        return pool.map(cpu_work, [10_000_000] * 4)

if __name__ == "__main__":
    # I/O-bound comparison
    start = time.perf_counter()
    asyncio.run(asyncio_approach())
    print(f"Asyncio (I/O):        {time.perf_counter() - start:.2f}s")
    
    start = time.perf_counter()
    threading_approach()
    print(f"Threading (I/O):      {time.perf_counter() - start:.2f}s")
    
    # CPU-bound: only multiprocessing helps
    start = time.perf_counter()
    multiprocessing_approach()
    print(f"Multiprocessing (CPU): {time.perf_counter() - start:.2f}s")

Use multiprocessing when: Your workload is CPU-bound (number crunching, image processing, data transformation), and you need to utilize multiple cores.

Use threading when: Your workload is I/O-bound with shared mutable state that’s awkward to serialize, or you’re integrating with libraries that aren’t async-compatible.

Use asyncio when: You have high-concurrency I/O (thousands of network connections) and can use async-compatible libraries.

Multiprocessing has overhead—process creation, serialization, and memory duplication. For tasks under 100ms, that overhead often exceeds the parallelization benefit. Profile first, parallelize second.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.