Python - Generators and Yield
• Generators provide memory-efficient iteration by producing values on-demand rather than storing entire sequences in memory, making them essential for processing large datasets or infinite sequences.
Key Insights
• Generators provide memory-efficient iteration by producing values on-demand rather than storing entire sequences in memory, making them essential for processing large datasets or infinite sequences.
• The yield keyword transforms a function into a generator, suspending execution and maintaining state between calls, enabling powerful patterns like pipelines and coroutines.
• Generator expressions offer a concise syntax for simple generators, while yield from enables elegant delegation to sub-generators for composing complex iteration logic.
Understanding Generators vs Regular Functions
Generators are functions that return an iterator object, producing values lazily using the yield keyword. Unlike regular functions that execute completely and return a single value, generators pause execution at each yield, maintaining their state for the next iteration.
# Regular function - loads everything into memory
def get_numbers_list(n):
result = []
for i in range(n):
result.append(i ** 2)
return result
# Generator function - produces values on demand
def get_numbers_generator(n):
for i in range(n):
yield i ** 2
# Memory comparison
import sys
list_result = get_numbers_list(1000)
gen_result = get_numbers_generator(1000)
print(f"List size: {sys.getsizeof(list_result)} bytes") # ~9000 bytes
print(f"Generator size: {sys.getsizeof(gen_result)} bytes") # ~112 bytes
The generator maintains minimal memory footprint regardless of how many values it will eventually produce. This becomes critical when working with millions of records or infinite sequences.
Generator Execution Flow
When you call a generator function, it returns a generator object without executing the function body. Execution begins only when you request the first value.
def debug_generator():
print("Starting generator")
yield 1
print("Between first and second yield")
yield 2
print("Between second and third yield")
yield 3
print("Generator exhausted")
gen = debug_generator() # No output - function hasn't started
print("Generator created")
print(next(gen)) # Prints: Starting generator, then 1
print(next(gen)) # Prints: Between first and second yield, then 2
print(next(gen)) # Prints: Between second and third yield, then 3
# next(gen) would raise StopIteration
Each next() call resumes execution from the last yield, maintaining local variables and execution state. This state preservation enables powerful patterns.
Practical Use Cases
Processing Large Files
Generators excel at processing files that don’t fit in memory:
def read_large_file(file_path, chunk_size=1024):
"""Read file in chunks without loading entire file"""
with open(file_path, 'r') as file:
while True:
chunk = file.read(chunk_size)
if not chunk:
break
yield chunk
def process_log_lines(file_path):
"""Process log file line by line"""
with open(file_path, 'r') as file:
for line in file:
if 'ERROR' in line:
yield line.strip()
# Usage
for error_line in process_log_lines('app.log'):
# Process one line at a time - constant memory usage
print(error_line)
Creating Data Pipelines
Chain generators to create processing pipelines where each stage transforms data:
def read_csv(file_path):
"""Generator: read CSV rows"""
with open(file_path, 'r') as f:
next(f) # Skip header
for line in f:
yield line.strip().split(',')
def filter_active_users(rows):
"""Generator: filter for active users"""
for row in rows:
if row[2] == 'active': # status column
yield row
def extract_emails(rows):
"""Generator: extract email addresses"""
for row in rows:
yield row[1] # email column
# Pipeline composition
pipeline = extract_emails(
filter_active_users(
read_csv('users.csv')
)
)
# Process one record at a time through entire pipeline
for email in pipeline:
send_notification(email)
Each stage processes one item at a time, maintaining minimal memory footprint regardless of file size.
Generator Expressions
Generator expressions provide concise syntax for simple generators, similar to list comprehensions but with lazy evaluation:
# List comprehension - creates entire list in memory
squares_list = [x**2 for x in range(1000000)]
# Generator expression - creates iterator
squares_gen = (x**2 for x in range(1000000))
# Use in functions expecting iterables
total = sum(x**2 for x in range(1000000))
# Filtering with conditions
even_squares = (x**2 for x in range(1000) if x % 2 == 0)
# Nested iteration
pairs = ((x, y) for x in range(10) for y in range(10) if x < y)
Generator expressions are ideal for one-time iteration where you don’t need to store intermediate results.
Advanced Patterns with yield from
The yield from statement delegates to another generator, simplifying code that chains or flattens iterables:
def flatten(nested_list):
"""Recursively flatten nested lists"""
for item in nested_list:
if isinstance(item, list):
yield from flatten(item)
else:
yield item
data = [1, [2, 3, [4, 5]], 6, [7, [8, 9]]]
print(list(flatten(data))) # [1, 2, 3, 4, 5, 6, 7, 8, 9]
# Without yield from, you'd need:
def flatten_manual(nested_list):
for item in nested_list:
if isinstance(item, list):
for sub_item in flatten_manual(item):
yield sub_item
else:
yield item
Delegating to multiple generators:
def read_multiple_files(*file_paths):
"""Read lines from multiple files sequentially"""
for path in file_paths:
yield from read_log_lines(path)
# Equivalent to chaining
for line in read_multiple_files('app1.log', 'app2.log', 'app3.log'):
process(line)
Sending Values and Two-Way Communication
Generators can receive values through the send() method, enabling coroutine-like behavior:
def running_average():
"""Calculate running average of sent values"""
total = 0
count = 0
average = None
while True:
value = yield average
total += value
count += 1
average = total / count
avg = running_average()
next(avg) # Prime the generator
print(avg.send(10)) # 10.0
print(avg.send(20)) # 15.0
print(avg.send(30)) # 20.0
Implementing a stateful filter:
def threshold_filter(initial_threshold):
"""Filter values above dynamic threshold"""
threshold = initial_threshold
while True:
value = yield
if value is None: # Allow threshold updates
threshold = yield # Wait for new threshold
elif value > threshold:
print(f"Accepted: {value}")
else:
print(f"Rejected: {value}")
filter_gen = threshold_filter(50)
next(filter_gen) # Prime
filter_gen.send(60) # Accepted: 60
filter_gen.send(40) # Rejected: 40
filter_gen.send(None) # Signal threshold change
filter_gen.send(30) # Update threshold to 30
filter_gen.send(40) # Accepted: 40
Performance Considerations
Generators trade CPU cycles for memory efficiency. Benchmark when performance matters:
import time
def benchmark_iteration(n):
# List approach
start = time.time()
data = [x**2 for x in range(n)]
result = sum(data)
list_time = time.time() - start
# Generator approach
start = time.time()
result = sum(x**2 for x in range(n))
gen_time = time.time() - start
print(f"List: {list_time:.4f}s")
print(f"Generator: {gen_time:.4f}s")
benchmark_iteration(1000000)
Use generators when:
- Working with large datasets that don’t fit in memory
- Processing streams or infinite sequences
- Building data pipelines with multiple transformation stages
- You only need to iterate once
Use lists when:
- You need random access to elements
- Multiple iterations over the same data are required
- The dataset is small enough to fit comfortably in memory
- You need to know the length before processing
Generators are a fundamental tool for writing memory-efficient Python code. They enable processing unlimited data streams with constant memory usage and provide elegant solutions for complex iteration patterns.