Python - Generators and Yield

• Generators provide memory-efficient iteration by producing values on-demand rather than storing entire sequences in memory, making them essential for processing large datasets or infinite sequences.

Key Insights

• Generators provide memory-efficient iteration by producing values on-demand rather than storing entire sequences in memory, making them essential for processing large datasets or infinite sequences. • The yield keyword transforms a function into a generator, suspending execution and maintaining state between calls, enabling powerful patterns like pipelines and coroutines. • Generator expressions offer a concise syntax for simple generators, while yield from enables elegant delegation to sub-generators for composing complex iteration logic.

Understanding Generators vs Regular Functions

Generators are functions that return an iterator object, producing values lazily using the yield keyword. Unlike regular functions that execute completely and return a single value, generators pause execution at each yield, maintaining their state for the next iteration.

# Regular function - loads everything into memory
def get_numbers_list(n):
    result = []
    for i in range(n):
        result.append(i ** 2)
    return result

# Generator function - produces values on demand
def get_numbers_generator(n):
    for i in range(n):
        yield i ** 2

# Memory comparison
import sys
list_result = get_numbers_list(1000)
gen_result = get_numbers_generator(1000)

print(f"List size: {sys.getsizeof(list_result)} bytes")  # ~9000 bytes
print(f"Generator size: {sys.getsizeof(gen_result)} bytes")  # ~112 bytes

The generator maintains minimal memory footprint regardless of how many values it will eventually produce. This becomes critical when working with millions of records or infinite sequences.

Generator Execution Flow

When you call a generator function, it returns a generator object without executing the function body. Execution begins only when you request the first value.

def debug_generator():
    print("Starting generator")
    yield 1
    print("Between first and second yield")
    yield 2
    print("Between second and third yield")
    yield 3
    print("Generator exhausted")

gen = debug_generator()  # No output - function hasn't started
print("Generator created")

print(next(gen))  # Prints: Starting generator, then 1
print(next(gen))  # Prints: Between first and second yield, then 2
print(next(gen))  # Prints: Between second and third yield, then 3
# next(gen) would raise StopIteration

Each next() call resumes execution from the last yield, maintaining local variables and execution state. This state preservation enables powerful patterns.

Practical Use Cases

Processing Large Files

Generators excel at processing files that don’t fit in memory:

def read_large_file(file_path, chunk_size=1024):
    """Read file in chunks without loading entire file"""
    with open(file_path, 'r') as file:
        while True:
            chunk = file.read(chunk_size)
            if not chunk:
                break
            yield chunk

def process_log_lines(file_path):
    """Process log file line by line"""
    with open(file_path, 'r') as file:
        for line in file:
            if 'ERROR' in line:
                yield line.strip()

# Usage
for error_line in process_log_lines('app.log'):
    # Process one line at a time - constant memory usage
    print(error_line)

Creating Data Pipelines

Chain generators to create processing pipelines where each stage transforms data:

def read_csv(file_path):
    """Generator: read CSV rows"""
    with open(file_path, 'r') as f:
        next(f)  # Skip header
        for line in f:
            yield line.strip().split(',')

def filter_active_users(rows):
    """Generator: filter for active users"""
    for row in rows:
        if row[2] == 'active':  # status column
            yield row

def extract_emails(rows):
    """Generator: extract email addresses"""
    for row in rows:
        yield row[1]  # email column

# Pipeline composition
pipeline = extract_emails(
    filter_active_users(
        read_csv('users.csv')
    )
)

# Process one record at a time through entire pipeline
for email in pipeline:
    send_notification(email)

Each stage processes one item at a time, maintaining minimal memory footprint regardless of file size.

Generator Expressions

Generator expressions provide concise syntax for simple generators, similar to list comprehensions but with lazy evaluation:

# List comprehension - creates entire list in memory
squares_list = [x**2 for x in range(1000000)]

# Generator expression - creates iterator
squares_gen = (x**2 for x in range(1000000))

# Use in functions expecting iterables
total = sum(x**2 for x in range(1000000))

# Filtering with conditions
even_squares = (x**2 for x in range(1000) if x % 2 == 0)

# Nested iteration
pairs = ((x, y) for x in range(10) for y in range(10) if x < y)

Generator expressions are ideal for one-time iteration where you don’t need to store intermediate results.

Advanced Patterns with yield from

The yield from statement delegates to another generator, simplifying code that chains or flattens iterables:

def flatten(nested_list):
    """Recursively flatten nested lists"""
    for item in nested_list:
        if isinstance(item, list):
            yield from flatten(item)
        else:
            yield item

data = [1, [2, 3, [4, 5]], 6, [7, [8, 9]]]
print(list(flatten(data)))  # [1, 2, 3, 4, 5, 6, 7, 8, 9]

# Without yield from, you'd need:
def flatten_manual(nested_list):
    for item in nested_list:
        if isinstance(item, list):
            for sub_item in flatten_manual(item):
                yield sub_item
        else:
            yield item

Delegating to multiple generators:

def read_multiple_files(*file_paths):
    """Read lines from multiple files sequentially"""
    for path in file_paths:
        yield from read_log_lines(path)

# Equivalent to chaining
for line in read_multiple_files('app1.log', 'app2.log', 'app3.log'):
    process(line)

Sending Values and Two-Way Communication

Generators can receive values through the send() method, enabling coroutine-like behavior:

def running_average():
    """Calculate running average of sent values"""
    total = 0
    count = 0
    average = None
    
    while True:
        value = yield average
        total += value
        count += 1
        average = total / count

avg = running_average()
next(avg)  # Prime the generator

print(avg.send(10))  # 10.0
print(avg.send(20))  # 15.0
print(avg.send(30))  # 20.0

Implementing a stateful filter:

def threshold_filter(initial_threshold):
    """Filter values above dynamic threshold"""
    threshold = initial_threshold
    
    while True:
        value = yield
        if value is None:  # Allow threshold updates
            threshold = yield  # Wait for new threshold
        elif value > threshold:
            print(f"Accepted: {value}")
        else:
            print(f"Rejected: {value}")

filter_gen = threshold_filter(50)
next(filter_gen)  # Prime

filter_gen.send(60)  # Accepted: 60
filter_gen.send(40)  # Rejected: 40
filter_gen.send(None)  # Signal threshold change
filter_gen.send(30)  # Update threshold to 30
filter_gen.send(40)  # Accepted: 40

Performance Considerations

Generators trade CPU cycles for memory efficiency. Benchmark when performance matters:

import time

def benchmark_iteration(n):
    # List approach
    start = time.time()
    data = [x**2 for x in range(n)]
    result = sum(data)
    list_time = time.time() - start
    
    # Generator approach
    start = time.time()
    result = sum(x**2 for x in range(n))
    gen_time = time.time() - start
    
    print(f"List: {list_time:.4f}s")
    print(f"Generator: {gen_time:.4f}s")

benchmark_iteration(1000000)

Use generators when:

  • Working with large datasets that don’t fit in memory
  • Processing streams or infinite sequences
  • Building data pipelines with multiple transformation stages
  • You only need to iterate once

Use lists when:

  • You need random access to elements
  • Multiple iterations over the same data are required
  • The dataset is small enough to fit comfortably in memory
  • You need to know the length before processing

Generators are a fundamental tool for writing memory-efficient Python code. They enable processing unlimited data streams with constant memory usage and provide elegant solutions for complex iteration patterns.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.