Data Pipeline Patterns That Actually Work

Common patterns for building reliable data pipelines without over-engineering.

Key Insights

  • Every pipeline stage should be idempotent — use overwrite-by-partition so re-runs produce the same result
  • Route bad records to a dead letter queue instead of dropping them silently; you’ll need them for debugging
  • Track records processed, processing time, error rate, and data freshness for every pipeline in production

Data pipelines fail. The difference between a good pipeline and a bad one is how gracefully it handles failure and how easy it is to debug.

Idempotent Writes

Every pipeline stage should be safe to re-run. Use overwrite-by-partition instead of append:

(
    df.write
    .mode("overwrite")
    .partitionBy("date")
    .parquet("s3://warehouse/events/")
)

If a job fails halfway through, re-running it produces the same result.

Schema Validation

Catch schema issues at the boundary, not deep in your transformations:

EXPECTED_SCHEMA = {
    "user_id": str,
    "event_type": str,
    "timestamp": datetime,
    "amount": float,
}

def validate_schema(df, expected):
    for col, dtype in expected.items():
        if col not in df.columns:
            raise ValueError(f"Missing column: {col}")

Dead Letter Queues

Don’t drop bad records silently. Route them somewhere for inspection:

good_records, bad_records = [], []
for record in batch:
    try:
        validated = parse_and_validate(record)
        good_records.append(validated)
    except ValidationError as e:
        bad_records.append({"record": record, "error": str(e)})

process(good_records)
send_to_dlq(bad_records)

Checkpointing

For long-running pipelines, save progress so you can resume:

last_checkpoint = load_checkpoint("pipeline_x")
new_data = source.read(since=last_checkpoint)
result = transform(new_data)
write(result)
save_checkpoint("pipeline_x", new_data.max_timestamp)

Monitoring

Track these metrics for every pipeline:

  • Records processed per run
  • Processing time per stage
  • Error rate and error types
  • Data freshness (time since last successful run)

Simple patterns, applied consistently, produce reliable pipelines.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.