Python - Writing Efficient Data Processing Code

Key Insights

Choosing the right data structure can yield 100x+ performance improvements—sets and dictionaries offer O(1) lookups compared to O(n) for lists
Generators and lazy evaluation let you process datasets larger than available memory by yielding items one at a time instead of loading everything upfront
Always profile before optimizing; intuition about bottlenecks is often wrong, and premature optimization wastes time on code that doesn’t matter

Introduction: Why Efficiency Matters in Data Processing

Python’s reputation for being “slow” is both overstated and misunderstood. Yes, pure Python loops are slower than compiled languages. But most data processing bottlenecks come from poor algorithmic choices, not the language itself.

I’ve seen data pipelines that took 8 hours drop to 15 minutes after applying the techniques in this article. No Cython, no Rust extensions—just better Python.

The key insight: Python gives you access to highly optimized C libraries (NumPy, Pandas) and built-in functions. Your job is to stay out of the interpreter’s way and let these tools do the heavy lifting.

That said, don’t optimize prematurely. If your script runs in 2 seconds and you run it once a week, leave it alone. Focus on code that runs frequently, processes large datasets, or sits in critical paths.

Choosing the Right Data Structures

The single biggest performance win in data processing comes from using the right data structure. This isn’t about clever algorithms—it’s about understanding time complexity.

Lists are for ordered sequences, not lookups. Checking if an item exists in a list requires scanning every element (O(n)). With a million items, that’s a million comparisons per lookup.

Sets and dictionaries use hash tables. Lookups are O(1) on average—constant time regardless of size.

Here’s a concrete benchmark:

import time
import random

# Generate test data
data = list(range(1_000_000))
search_items = [random.randint(0, 2_000_000) for _ in range(10_000)]

# List lookup
data_list = data
start = time.perf_counter()
results_list = [item in data_list for item in search_items]
list_time = time.perf_counter() - start

# Set lookup
data_set = set(data)
start = time.perf_counter()
results_set = [item in data_set for item in search_items]
set_time = time.perf_counter() - start

print(f"List lookup: {list_time:.2f}s")
print(f"Set lookup:  {set_time:.4f}s")
print(f"Speedup:     {list_time / set_time:.0f}x")

Typical output:

List lookup: 142.35s
Set lookup:  0.0012s
Speedup:     118625x

That’s not a typo. Over 100,000x faster.

The collections module provides specialized containers that solve common patterns:

from collections import defaultdict, Counter, deque

# defaultdict: no need to check if key exists
word_positions = defaultdict(list)
for i, word in enumerate(words):
    word_positions[word].append(i)  # No KeyError

# Counter: counting made trivial
word_counts = Counter(words)
top_10 = word_counts.most_common(10)

# deque: O(1) append/pop from both ends
recent_items = deque(maxlen=100)  # Auto-discards old items

Generator Expressions and Lazy Evaluation

When processing large datasets, memory often becomes the bottleneck before CPU does. Generators solve this by producing values on-demand instead of materializing entire collections.

Compare these two approaches for processing a large CSV:

# Memory-hungry: loads entire file into memory
def process_csv_eager(filepath):
    with open(filepath) as f:
        lines = f.readlines()  # All lines in memory
    
    records = []
    for line in lines[1:]:  # Skip header
        fields = line.strip().split(',')
        if float(fields[2]) > 1000:  # Filter condition
            records.append(transform(fields))
    return records

# Memory-efficient: processes one line at a time
def process_csv_lazy(filepath):
    with open(filepath) as f:
        next(f)  # Skip header
        for line in f:
            fields = line.strip().split(',')
            if float(fields[2]) > 1000:
                yield transform(fields)

The eager version loads a 10GB file entirely into memory. The lazy version never holds more than one line.

Generator expressions provide the same benefit in a compact syntax:

# List comprehension: creates list in memory
squares = [x**2 for x in range(10_000_000)]  # ~80MB

# Generator expression: creates iterator
squares = (x**2 for x in range(10_000_000))  # ~100 bytes

The itertools module extends this with powerful combinators:

import itertools

# Process in chunks
def chunked(iterable, size):
    it = iter(iterable)
    while chunk := list(itertools.islice(it, size)):
        yield chunk

# Chain multiple files without loading all
all_records = itertools.chain.from_iterable(
    process_csv_lazy(f) for f in file_list
)

Vectorization with NumPy and Pandas

Python loops are slow because each iteration involves type checking, function calls, and interpreter overhead. Vectorized operations push the loop into optimized C code.

Here’s a common anti-pattern—iterating over DataFrame rows:

import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({
    'price': np.random.uniform(10, 100, 1_000_000),
    'quantity': np.random.randint(1, 100, 1_000_000),
    'discount': np.random.uniform(0, 0.3, 1_000_000)
})

# SLOW: Row-by-row iteration
def calculate_totals_slow(df):
    totals = []
    for idx, row in df.iterrows():
        total = row['price'] * row['quantity'] * (1 - row['discount'])
        totals.append(total)
    df['total'] = totals
    return df

# FAST: Vectorized operations
def calculate_totals_fast(df):
    df['total'] = df['price'] * df['quantity'] * (1 - df['discount'])
    return df

Benchmark results on 1 million rows:

%timeit calculate_totals_slow(df.copy())
# 45.2 s ± 1.3 s per loop

%timeit calculate_totals_fast(df.copy())
# 12.3 ms ± 0.5 ms per loop

# Speedup: ~3,700x

The rule: if you’re writing for idx, row in df.iterrows(), you’re probably doing it wrong.

For conditional logic, use np.where or np.select:

# Instead of looping with if/else
df['category'] = np.select(
    [df['total'] < 100, df['total'] < 1000],
    ['small', 'medium'],
    default='large'
)

Leveraging Built-in Functions and Libraries

Python’s built-in functions are implemented in C and optimized heavily. They consistently outperform equivalent Python loops.

# Slow: manual loop
total = 0
for x in numbers:
    total += x

# Fast: built-in sum
total = sum(numbers)

# Slow: manual any check
found = False
for x in items:
    if condition(x):
        found = True
        break

# Fast: any() with generator
found = any(condition(x) for x in items)

The operator module provides function versions of operators, avoiding lambda overhead:

from operator import itemgetter, attrgetter

records = [
    {'name': 'Alice', 'score': 85},
    {'name': 'Bob', 'score': 92},
    {'name': 'Charlie', 'score': 78}
]

# Slower: lambda creates new function object each call
sorted_lambda = sorted(records, key=lambda x: x['score'])

# Faster: itemgetter is implemented in C
sorted_itemgetter = sorted(records, key=itemgetter('score'))

For objects with attributes, attrgetter provides the same benefit:

# Sort objects by attribute
sorted_users = sorted(users, key=attrgetter('created_at'))

# Multiple keys
sorted_users = sorted(users, key=attrgetter('department', 'name'))

Profiling and Identifying Bottlenecks

Never optimize based on intuition. Profile first.

import cProfile
import pstats

def slow_function():
    result = []
    for i in range(100000):
        result.append(expensive_operation(i))
    return aggregate(result)

# Profile the function
profiler = cProfile.Profile()
profiler.enable()
slow_function()
profiler.disable()

# Print sorted by cumulative time
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)  # Top 10 functions

For line-by-line analysis, use line_profiler:

# Install: pip install line_profiler
# Add @profile decorator to function, then run:
# kernprof -l -v script.py

@profile
def process_data(items):
    results = []
    for item in items:           # Line 4: 0.1s
        parsed = parse(item)     # Line 5: 2.3s  <- bottleneck!
        validated = validate(parsed)  # Line 6: 0.4s
        results.append(validated)
    return results

For memory issues, memory_profiler shows allocation per line:

# Install: pip install memory_profiler
from memory_profiler import profile

@profile
def memory_hungry():
    data = [x**2 for x in range(1000000)]  # +76 MiB
    filtered = [x for x in data if x % 2]  # +38 MiB
    return sum(filtered)

Practical Patterns for Batch Processing

For files too large to fit in memory, process in chunks:

import csv
from tqdm import tqdm
import os

def process_large_csv(filepath, chunk_size=10000):
    file_size = os.path.getsize(filepath)
    
    with open(filepath, 'r') as f:
        reader = csv.DictReader(f)
        
        chunk = []
        with tqdm(total=file_size, unit='B', unit_scale=True) as pbar:
            for row in reader:
                chunk.append(row)
                
                if len(chunk) >= chunk_size:
                    yield from process_chunk(chunk)
                    pbar.update(f.tell() - pbar.n)
                    chunk = []
            
            if chunk:  # Don't forget the last partial chunk
                yield from process_chunk(chunk)

def process_chunk(rows):
    # Your processing logic here
    for row in rows:
        yield transform(row)

For CPU-bound work, use multiprocessing:

from multiprocessing import Pool
from functools import partial

def process_file(filepath, config):
    # CPU-intensive processing
    return result

def parallel_process(filepaths, config, workers=4):
    process_fn = partial(process_file, config=config)
    
    with Pool(workers) as pool:
        results = pool.map(process_fn, filepaths)
    
    return results

For I/O-bound work (API calls, database queries), consider asyncio:

import asyncio
import aiohttp

async def fetch_data(session, url):
    async with session.get(url) as response:
        return await response.json()

async def fetch_all(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_data(session, url) for url in urls]
        return await asyncio.gather(*tasks)

# Run it
results = asyncio.run(fetch_all(url_list))

The difference matters: multiprocessing adds overhead for process creation and inter-process communication. It’s worth it for CPU-bound work but wasteful for I/O-bound tasks where async shines.

Write correct code first. Measure to find bottlenecks. Then apply these techniques surgically where they matter. That’s how you write efficient data processing code in Python.