Data Pipeline Patterns That Actually Work
Common patterns for building reliable data pipelines without over-engineering.
Key Insights
- Every pipeline stage should be idempotent — use overwrite-by-partition so re-runs produce the same result
- Route bad records to a dead letter queue instead of dropping them silently; you’ll need them for debugging
- Track records processed, processing time, error rate, and data freshness for every pipeline in production
Data pipelines fail. The difference between a good pipeline and a bad one is how gracefully it handles failure and how easy it is to debug.
Idempotent Writes
Every pipeline stage should be safe to re-run. Use overwrite-by-partition instead of append:
(
df.write
.mode("overwrite")
.partitionBy("date")
.parquet("s3://warehouse/events/")
)
If a job fails halfway through, re-running it produces the same result.
Schema Validation
Catch schema issues at the boundary, not deep in your transformations:
EXPECTED_SCHEMA = {
"user_id": str,
"event_type": str,
"timestamp": datetime,
"amount": float,
}
def validate_schema(df, expected):
for col, dtype in expected.items():
if col not in df.columns:
raise ValueError(f"Missing column: {col}")
Dead Letter Queues
Don’t drop bad records silently. Route them somewhere for inspection:
good_records, bad_records = [], []
for record in batch:
try:
validated = parse_and_validate(record)
good_records.append(validated)
except ValidationError as e:
bad_records.append({"record": record, "error": str(e)})
process(good_records)
send_to_dlq(bad_records)
Checkpointing
For long-running pipelines, save progress so you can resume:
last_checkpoint = load_checkpoint("pipeline_x")
new_data = source.read(since=last_checkpoint)
result = transform(new_data)
write(result)
save_checkpoint("pipeline_x", new_data.max_timestamp)
Monitoring
Track these metrics for every pipeline:
- Records processed per run
- Processing time per stage
- Error rate and error types
- Data freshness (time since last successful run)
Simple patterns, applied consistently, produce reliable pipelines.