Python Generators: yield and Generator Expressions
Generators are Python's solution to memory-efficient iteration. Unlike lists that store all elements in memory simultaneously, generators produce values on-the-fly, one at a time. This lazy...
Key Insights
- Generators use lazy evaluation to process data on-demand, consuming memory proportional to a single item rather than the entire dataset—critical when working with large files or infinite sequences
- The
yieldkeyword transforms functions into iterators that maintain state between calls, enabling elegant solutions for streaming data and pipeline architectures - Generator expressions provide syntactic sugar for simple generators, but understanding
yieldunlocks advanced patterns like coroutines and bidirectional data flow
Understanding Generators and Memory Efficiency
Generators are Python’s solution to memory-efficient iteration. Unlike lists that store all elements in memory simultaneously, generators produce values on-the-fly, one at a time. This lazy evaluation model means you can process datasets larger than available RAM or work with infinite sequences without crashing your program.
Consider the memory difference:
import sys
# List comprehension - stores everything in memory
numbers_list = [x * 2 for x in range(1000000)]
print(f"List size: {sys.getsizeof(numbers_list):,} bytes") # ~8,000,000 bytes
# Generator expression - stores only the generator object
numbers_gen = (x * 2 for x in range(1000000))
print(f"Generator size: {sys.getsizeof(numbers_gen)} bytes") # ~112 bytes
The list consumes roughly 8MB while the generator uses barely 100 bytes. The generator doesn’t compute or store values until you request them. This matters when processing log files, database results, or any data stream where you don’t need random access.
Use generators when:
- Processing large datasets sequentially
- Working with streams (files, network data, sensors)
- Implementing infinite sequences
- Building data transformation pipelines
Stick with lists when you need random access, multiple iterations, or the full dataset fits comfortably in memory.
The yield Keyword: Pausing Function Execution
The yield keyword transforms an ordinary function into a generator. When Python encounters yield, it returns the value to the caller but preserves the function’s state—local variables, instruction pointer, everything. The next time you call the generator, execution resumes right after the yield statement.
Here’s a Fibonacci generator demonstrating state preservation:
def fibonacci(limit):
a, b = 0, 1
count = 0
while count < limit:
yield a
a, b = b, a + b
count += 1
# Usage
for num in fibonacci(10):
print(num, end=' ') # 0 1 1 2 3 5 8 13 21 34
Each yield produces one Fibonacci number. The variables a, b, and count persist between calls. Compare this to a function with return, which would exit completely and lose all state.
Generators excel at reading large files without loading them entirely into memory:
def read_large_file(filepath):
"""Read file line-by-line without loading it all into memory."""
with open(filepath, 'r') as file:
for line in file:
yield line.strip()
# Process a 10GB log file with constant memory usage
for line in read_large_file('massive_log.txt'):
if 'ERROR' in line:
print(line)
You can also build custom iterators. Here’s range() reimplemented:
def custom_range(start, stop, step=1):
current = start
while current < stop:
yield current
current += step
# Works identically to range()
for i in custom_range(0, 10, 2):
print(i) # 0 2 4 6 8
Generator Expressions: Concise Syntax
Generator expressions use parentheses instead of square brackets, providing a compact alternative to generator functions for simple cases. They follow the same syntax as list comprehensions but produce generators.
# List comprehension - builds entire list immediately
squares_list = [x**2 for x in range(1000000)]
# Generator expression - computes values on demand
squares_gen = (x**2 for x in range(1000000))
# Both work the same in loops
for square in squares_gen:
if square > 1000:
break
Generator expressions shine in data pipelines where you chain transformations:
# Read file, filter lines, extract data, transform
log_file = open('access.log', 'r')
lines = (line.strip() for line in log_file)
error_lines = (line for line in lines if 'ERROR' in line)
timestamps = (line.split()[0] for line in error_lines)
# Nothing executes until you iterate
for timestamp in timestamps:
print(timestamp)
Each generator in the chain processes one item at a time. Memory usage remains constant regardless of file size.
Benchmark the memory difference:
import sys
import time
def benchmark_memory():
# List comprehension
start = time.time()
data_list = [x**2 for x in range(10000000)]
list_time = time.time() - start
list_size = sys.getsizeof(data_list)
# Generator expression
start = time.time()
data_gen = (x**2 for x in range(10000000))
gen_time = time.time() - start
gen_size = sys.getsizeof(data_gen)
print(f"List: {list_size:,} bytes, {list_time:.4f}s")
print(f"Generator: {gen_size} bytes, {gen_time:.6f}s")
benchmark_memory()
# List: 89,095,160 bytes, 0.8234s
# Generator: 112 bytes, 0.000002s
Advanced Generator Methods
Generators implement the iterator protocol with methods beyond basic iteration. The .send() method enables bidirectional communication, turning generators into coroutines.
def running_average():
total = 0
count = 0
average = None
while True:
value = yield average
if value is None:
break
total += value
count += 1
average = total / count
# Two-way communication
avg = running_average()
next(avg) # Prime the generator
print(avg.send(10)) # 10.0
print(avg.send(20)) # 15.0
print(avg.send(30)) # 20.0
The generator receives values via yield expressions and sends results back. You must call next() first to advance to the first yield statement—this is called “priming” the generator.
Other methods include:
.throw(exception): Raises an exception inside the generator.close(): RaisesGeneratorExitto clean up resources
When a generator exhausts, it raises StopIteration. Python’s for loops handle this automatically, but manual iteration requires exception handling:
gen = (x for x in range(3))
print(next(gen)) # 0
print(next(gen)) # 1
print(next(gen)) # 2
print(next(gen)) # StopIteration exception
Real-World Applications
Log File Parser
Process multi-gigabyte log files with minimal memory:
def parse_log_file(filepath):
"""Parse log entries and extract structured data."""
with open(filepath, 'r') as f:
for line in f:
if not line.strip():
continue
parts = line.split(' - ')
if len(parts) >= 3:
yield {
'timestamp': parts[0],
'level': parts[1],
'message': parts[2].strip()
}
# Filter and process without loading entire file
for entry in parse_log_file('app.log'):
if entry['level'] == 'ERROR':
send_alert(entry)
Data Transformation Pipeline
Chain generators for ETL operations:
def read_csv(filepath):
"""Read CSV file line by line."""
with open(filepath, 'r') as f:
next(f) # Skip header
for line in f:
yield line.strip().split(',')
def filter_valid(rows):
"""Filter out invalid rows."""
for row in rows:
if len(row) >= 3 and row[2].isdigit():
yield row
def transform(rows):
"""Transform data structure."""
for row in rows:
yield {
'name': row[0],
'email': row[1],
'age': int(row[2])
}
# Pipeline processes one row at a time
pipeline = transform(filter_valid(read_csv('users.csv')))
for user in pipeline:
save_to_database(user)
Infinite Sequences
Generate unlimited values without memory concerns:
def prime_numbers():
"""Generate infinite sequence of prime numbers."""
def is_prime(n):
if n < 2:
return False
for i in range(2, int(n**0.5) + 1):
if n % i == 0:
return False
return True
num = 2
while True:
if is_prime(num):
yield num
num += 1
# Take first 10 primes
primes = prime_numbers()
first_ten = [next(primes) for _ in range(10)]
print(first_ten) # [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
Performance and Best Practices
Generators optimize memory but add slight computational overhead. Benchmark before optimizing:
import time
def benchmark_processing():
data_range = range(10000000)
# List approach
start = time.time()
result_list = [x * 2 for x in data_range if x % 2 == 0]
total = sum(result_list)
list_time = time.time() - start
# Generator approach
start = time.time()
result_gen = (x * 2 for x in data_range if x % 2 == 0)
total = sum(result_gen)
gen_time = time.time() - start
print(f"List: {list_time:.4f}s")
print(f"Generator: {gen_time:.4f}s")
benchmark_processing()
# List: 1.2341s
# Generator: 1.1876s (slightly faster, much less memory)
Critical pitfall: Generators exhaust after one iteration:
gen = (x for x in range(5))
print(list(gen)) # [0, 1, 2, 3, 4]
print(list(gen)) # [] - exhausted!
# Solution: Recreate or convert to list if multiple passes needed
data = list(gen) if need_multiple_passes else gen
Best practices:
- Use generators for single-pass sequential processing
- Convert to lists when you need random access or multiple iterations
- Chain generators for complex transformations instead of intermediate lists
- Prime coroutine-style generators with
next()before sending values - Close generators explicitly if they manage resources
Generators are fundamental to writing memory-efficient Python. Master yield and generator expressions to build scalable data processing pipelines that handle datasets of any size.