Kubernetes Jobs and CronJobs: Batch Processing
Kubernetes excels at running long-lived services, but batch processing represents an equally important workload pattern. Unlike Deployments that maintain a desired number of continuously running...
Key Insights
- Jobs guarantee task completion with configurable parallelism and retry logic, while CronJobs automate recurring batch workloads using familiar cron syntax—choose Jobs for one-time processing and CronJobs for scheduled operations.
- Parallelism strategies dramatically impact performance: use fixed completion counts for known workloads, indexed jobs for distributed processing with coordination, and work queues for dynamic task distribution.
- Production batch workloads require TTL controllers for automatic cleanup, proper backoffLimit configuration to prevent infinite retries, and concurrencyPolicy settings to avoid resource contention from overlapping CronJob executions.
Introduction to Batch Processing in Kubernetes
Kubernetes excels at running long-lived services, but batch processing represents an equally important workload pattern. Unlike Deployments that maintain a desired number of continuously running pods, Jobs and CronJobs execute tasks to completion and terminate.
Use Jobs when you need guaranteed completion of finite tasks: data migrations, report generation, image processing, or ETL pipelines. Use CronJobs for recurring scheduled work: nightly backups, hourly data synchronization, or periodic cleanup operations. Reserve Deployments for services that should run indefinitely and handle ongoing traffic.
The fundamental difference is lifecycle management. Deployments replace failed pods to maintain availability. Jobs retry failed pods until success criteria are met, then stop. This distinction matters for resource utilization, cost optimization, and operational patterns.
Kubernetes Jobs Fundamentals
A Job creates one or more pods and ensures a specified number complete successfully. The Job controller tracks completion and manages retries according to your configuration.
Key Job specifications:
- completions: Number of successful pod completions required (default: 1)
- parallelism: Maximum pods running concurrently (default: 1)
- backoffLimit: Maximum retry attempts before marking the Job as failed (default: 6)
- restartPolicy: Must be
OnFailureorNever(notAlwayslike Deployments)
Here’s a basic Job that processes data files:
apiVersion: batch/v1
kind: Job
metadata:
name: data-processor
spec:
completions: 1
backoffLimit: 4
template:
spec:
restartPolicy: OnFailure
containers:
- name: processor
image: myapp/data-processor:v1.2
command: ["python", "process_data.py"]
args: ["--input", "/data/raw", "--output", "/data/processed"]
env:
- name: BATCH_SIZE
value: "1000"
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: data-pvc
This Job runs once to completion. If the pod fails, Kubernetes restarts it up to 4 times. With restartPolicy: OnFailure, the pod restarts in place. With Never, Kubernetes creates new pods for each retry.
Job Patterns and Parallelism
Parallelism enables processing large workloads faster by running multiple pods simultaneously. Three primary patterns exist:
Fixed Completion Count: Process N items with M workers. Set both completions and parallelism:
apiVersion: batch/v1
kind: Job
metadata:
name: parallel-processor
spec:
completions: 10
parallelism: 3
template:
spec:
restartPolicy: OnFailure
containers:
- name: worker
image: myapp/worker:v2.0
command: ["./process_chunk.sh"]
env:
- name: CHUNK_SIZE
value: "500"
This creates 10 successful completions using up to 3 concurrent pods. Kubernetes automatically manages pod creation as earlier pods complete.
Indexed Jobs: Each pod receives a unique completion index (0 to completions-1), enabling coordinated parallel processing:
apiVersion: batch/v1
kind: Job
metadata:
name: indexed-processor
spec:
completions: 5
parallelism: 2
completionMode: Indexed
template:
spec:
restartPolicy: OnFailure
containers:
- name: worker
image: myapp/indexed-worker:v1.0
command: ["python", "process_partition.py"]
env:
- name: PARTITION_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
The JOB_COMPLETION_INDEX environment variable (automatically set) tells each pod which partition to process. This pattern works well for processing database shards, file ranges, or date partitions.
Work Queue Pattern: Pods pull tasks from an external queue until empty. Set parallelism without completions:
apiVersion: batch/v1
kind: Job
metadata:
name: queue-consumer
spec:
parallelism: 5
template:
spec:
restartPolicy: OnFailure
containers:
- name: consumer
image: myapp/queue-worker:v1.5
env:
- name: QUEUE_URL
value: "redis://redis-service:6379/tasks"
- name: QUEUE_EMPTY_WAIT
value: "30"
Workers signal completion by exiting successfully when the queue is empty. This provides dynamic load balancing but requires external queue infrastructure.
CronJobs for Scheduled Tasks
CronJobs wrap Jobs with scheduling logic using standard cron syntax. They’re ideal for recurring maintenance, reporting, and data synchronization tasks.
Basic CronJob for nightly database backups:
apiVersion: batch/v1
kind: CronJob
metadata:
name: database-backup
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: postgres:15-alpine
command: ["/bin/sh", "-c"]
args:
- |
pg_dump -h postgres-service -U $POSTGRES_USER $POSTGRES_DB | \
gzip > /backups/backup-$(date +%Y%m%d-%H%M%S).sql.gz
env:
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: db-credentials
key: username
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: db-credentials
key: password
- name: POSTGRES_DB
value: production
volumeMounts:
- name: backup-storage
mountPath: /backups
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
Control concurrent execution and history retention:
apiVersion: batch/v1
kind: CronJob
metadata:
name: hourly-sync
spec:
schedule: "0 * * * *"
concurrencyPolicy: Forbid # Prevent overlapping runs
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
startingDeadlineSeconds: 300 # Skip if 5+ minutes late
jobTemplate:
spec:
backoffLimit: 2
template:
spec:
restartPolicy: Never
containers:
- name: sync
image: myapp/sync-service:v3.1
command: ["./sync.sh"]
concurrencyPolicy options:
Allow: Multiple Jobs can run simultaneously (default)Forbid: Skip new run if previous Job still activeReplace: Cancel running Job and start new one
Error Handling and Retry Logic
Robust error handling prevents resource waste and ensures reliable completion. Configure retry behavior with backoffLimit and activeDeadlineSeconds:
apiVersion: batch/v1
kind: Job
metadata:
name: resilient-processor
spec:
backoffLimit: 6 # Maximum retries
activeDeadlineSeconds: 3600 # Fail after 1 hour total
template:
spec:
restartPolicy: OnFailure
containers:
- name: processor
image: myapp/processor:v2.3
command: ["python", "process.py"]
env:
- name: MAX_RETRIES
value: "3"
- name: RETRY_DELAY
value: "10"
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
Kubernetes applies exponential backoff between retries: 10s, 20s, 40s, up to 6 minutes. The activeDeadlineSeconds provides an absolute timeout regardless of retry count.
Design idempotent jobs that can safely retry. Use unique transaction IDs, check for existing output before processing, and implement proper state management. Avoid jobs that partially complete and can’t resume—they’ll waste resources on every retry.
Monitoring and Troubleshooting
Track Job execution with kubectl commands:
# List all jobs and their status
kubectl get jobs
# Detailed job information
kubectl describe job data-processor
# View job events
kubectl get events --field-selector involvedObject.name=data-processor
# Get pods created by job
kubectl get pods --selector=job-name=data-processor
# View logs from job pods
kubectl logs job/data-processor
# Follow logs from latest pod
kubectl logs -f job/data-processor --tail=50
# Check CronJob schedule and last run
kubectl get cronjobs
kubectl describe cronjob database-backup
Monitor job completion programmatically:
# Wait for job completion (exits 0 on success)
kubectl wait --for=condition=complete --timeout=600s job/data-processor
# Check for failure
kubectl wait --for=condition=failed --timeout=600s job/data-processor
Jobs create events visible in kubectl describe output. Failed jobs remain in the cluster for debugging—inspect pod logs and exit codes to diagnose issues.
Production Best Practices
Production-ready CronJob with comprehensive configuration:
apiVersion: batch/v1
kind: CronJob
metadata:
name: production-etl
namespace: data-pipeline
spec:
schedule: "*/30 * * * *" # Every 30 minutes
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 2
startingDeadlineSeconds: 600
jobTemplate:
spec:
ttlSecondsAfterFinished: 7200 # Auto-delete after 2 hours
backoffLimit: 3
activeDeadlineSeconds: 1800
template:
metadata:
labels:
app: etl-pipeline
component: data-processor
spec:
serviceAccountName: etl-service-account
restartPolicy: OnFailure
containers:
- name: etl
image: myapp/etl-processor:v4.2.1
imagePullPolicy: IfNotPresent
command: ["python", "-u", "etl_pipeline.py"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: database-credentials
key: connection-string
- name: S3_BUCKET
valueFrom:
configMapKeyRef:
name: etl-config
key: output-bucket
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
volumeMounts:
- name: temp
mountPath: /tmp
volumes:
- name: temp
emptyDir: {}
Critical production considerations:
TTL Controllers: Set ttlSecondsAfterFinished to automatically clean up completed Jobs. Without this, Jobs accumulate indefinitely.
Resource Limits: Always specify requests and limits. Batch jobs often process variable data sizes—limits prevent resource exhaustion.
Security Context: Run as non-root user, disable privilege escalation, and use read-only root filesystems when possible.
Service Accounts: Create dedicated service accounts with minimal permissions for accessing secrets, APIs, or cloud resources.
Monitoring Integration: Add labels for Prometheus scraping, emit metrics from job code, and configure alerting for failed jobs.
Batch processing in Kubernetes provides powerful primitives for reliable task execution. Jobs guarantee completion with configurable parallelism and retry logic. CronJobs automate recurring work with familiar scheduling syntax. Proper configuration of timeouts, resource limits, and cleanup policies ensures efficient, reliable batch workloads in production environments.