Data Pipeline Patterns That Actually Work

Key Insights

Every pipeline stage should be idempotent — use overwrite-by-partition so re-runs produce the same result
Route bad records to a dead letter queue instead of dropping them silently; you’ll need them for debugging
Track records processed, processing time, error rate, and data freshness for every pipeline in production

Data pipelines fail. The difference between a good pipeline and a bad one is how gracefully it handles failure and how easy it is to debug.

Idempotent Writes

Every pipeline stage should be safe to re-run. Use overwrite-by-partition instead of append:

(
    df.write
    .mode("overwrite")
    .partitionBy("date")
    .parquet("s3://warehouse/events/")
)

If a job fails halfway through, re-running it produces the same result.

Schema Validation

Catch schema issues at the boundary, not deep in your transformations:

EXPECTED_SCHEMA = {
    "user_id": str,
    "event_type": str,
    "timestamp": datetime,
    "amount": float,
}

def validate_schema(df, expected):
    for col, dtype in expected.items():
        if col not in df.columns:
            raise ValueError(f"Missing column: {col}")

Dead Letter Queues

Don’t drop bad records silently. Route them somewhere for inspection:

good_records, bad_records = [], []
for record in batch:
    try:
        validated = parse_and_validate(record)
        good_records.append(validated)
    except ValidationError as e:
        bad_records.append({"record": record, "error": str(e)})

process(good_records)
send_to_dlq(bad_records)

Checkpointing

For long-running pipelines, save progress so you can resume:

last_checkpoint = load_checkpoint("pipeline_x")
new_data = source.read(since=last_checkpoint)
result = transform(new_data)
write(result)
save_checkpoint("pipeline_x", new_data.max_timestamp)

Monitoring

Track these metrics for every pipeline:

Records processed per run
Processing time per stage
Error rate and error types
Data freshness (time since last successful run)

Simple patterns, applied consistently, produce reliable pipelines.

Idempotent Writes

Schema Validation

Dead Letter Queues

Checkpointing

Monitoring

Liked this? There's more.

Similar Articles