Kubernetes Jobs and CronJobs: Batch Processing

Kubernetes excels at running long-lived services, but batch processing represents an equally important workload pattern. Unlike Deployments that maintain a desired number of continuously running...

Key Insights

  • Jobs guarantee task completion with configurable parallelism and retry logic, while CronJobs automate recurring batch workloads using familiar cron syntax—choose Jobs for one-time processing and CronJobs for scheduled operations.
  • Parallelism strategies dramatically impact performance: use fixed completion counts for known workloads, indexed jobs for distributed processing with coordination, and work queues for dynamic task distribution.
  • Production batch workloads require TTL controllers for automatic cleanup, proper backoffLimit configuration to prevent infinite retries, and concurrencyPolicy settings to avoid resource contention from overlapping CronJob executions.

Introduction to Batch Processing in Kubernetes

Kubernetes excels at running long-lived services, but batch processing represents an equally important workload pattern. Unlike Deployments that maintain a desired number of continuously running pods, Jobs and CronJobs execute tasks to completion and terminate.

Use Jobs when you need guaranteed completion of finite tasks: data migrations, report generation, image processing, or ETL pipelines. Use CronJobs for recurring scheduled work: nightly backups, hourly data synchronization, or periodic cleanup operations. Reserve Deployments for services that should run indefinitely and handle ongoing traffic.

The fundamental difference is lifecycle management. Deployments replace failed pods to maintain availability. Jobs retry failed pods until success criteria are met, then stop. This distinction matters for resource utilization, cost optimization, and operational patterns.

Kubernetes Jobs Fundamentals

A Job creates one or more pods and ensures a specified number complete successfully. The Job controller tracks completion and manages retries according to your configuration.

Key Job specifications:

  • completions: Number of successful pod completions required (default: 1)
  • parallelism: Maximum pods running concurrently (default: 1)
  • backoffLimit: Maximum retry attempts before marking the Job as failed (default: 6)
  • restartPolicy: Must be OnFailure or Never (not Always like Deployments)

Here’s a basic Job that processes data files:

apiVersion: batch/v1
kind: Job
metadata:
  name: data-processor
spec:
  completions: 1
  backoffLimit: 4
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: processor
        image: myapp/data-processor:v1.2
        command: ["python", "process_data.py"]
        args: ["--input", "/data/raw", "--output", "/data/processed"]
        env:
        - name: BATCH_SIZE
          value: "1000"
        volumeMounts:
        - name: data-volume
          mountPath: /data
      volumes:
      - name: data-volume
        persistentVolumeClaim:
          claimName: data-pvc

This Job runs once to completion. If the pod fails, Kubernetes restarts it up to 4 times. With restartPolicy: OnFailure, the pod restarts in place. With Never, Kubernetes creates new pods for each retry.

Job Patterns and Parallelism

Parallelism enables processing large workloads faster by running multiple pods simultaneously. Three primary patterns exist:

Fixed Completion Count: Process N items with M workers. Set both completions and parallelism:

apiVersion: batch/v1
kind: Job
metadata:
  name: parallel-processor
spec:
  completions: 10
  parallelism: 3
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: worker
        image: myapp/worker:v2.0
        command: ["./process_chunk.sh"]
        env:
        - name: CHUNK_SIZE
          value: "500"

This creates 10 successful completions using up to 3 concurrent pods. Kubernetes automatically manages pod creation as earlier pods complete.

Indexed Jobs: Each pod receives a unique completion index (0 to completions-1), enabling coordinated parallel processing:

apiVersion: batch/v1
kind: Job
metadata:
  name: indexed-processor
spec:
  completions: 5
  parallelism: 2
  completionMode: Indexed
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: worker
        image: myapp/indexed-worker:v1.0
        command: ["python", "process_partition.py"]
        env:
        - name: PARTITION_INDEX
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']

The JOB_COMPLETION_INDEX environment variable (automatically set) tells each pod which partition to process. This pattern works well for processing database shards, file ranges, or date partitions.

Work Queue Pattern: Pods pull tasks from an external queue until empty. Set parallelism without completions:

apiVersion: batch/v1
kind: Job
metadata:
  name: queue-consumer
spec:
  parallelism: 5
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: consumer
        image: myapp/queue-worker:v1.5
        env:
        - name: QUEUE_URL
          value: "redis://redis-service:6379/tasks"
        - name: QUEUE_EMPTY_WAIT
          value: "30"

Workers signal completion by exiting successfully when the queue is empty. This provides dynamic load balancing but requires external queue infrastructure.

CronJobs for Scheduled Tasks

CronJobs wrap Jobs with scheduling logic using standard cron syntax. They’re ideal for recurring maintenance, reporting, and data synchronization tasks.

Basic CronJob for nightly database backups:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: database-backup
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
          - name: backup
            image: postgres:15-alpine
            command: ["/bin/sh", "-c"]
            args:
            - |
              pg_dump -h postgres-service -U $POSTGRES_USER $POSTGRES_DB | \
              gzip > /backups/backup-$(date +%Y%m%d-%H%M%S).sql.gz              
            env:
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: username
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: password
            - name: POSTGRES_DB
              value: production
            volumeMounts:
            - name: backup-storage
              mountPath: /backups
          volumes:
          - name: backup-storage
            persistentVolumeClaim:
              claimName: backup-pvc

Control concurrent execution and history retention:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: hourly-sync
spec:
  schedule: "0 * * * *"
  concurrencyPolicy: Forbid  # Prevent overlapping runs
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  startingDeadlineSeconds: 300  # Skip if 5+ minutes late
  jobTemplate:
    spec:
      backoffLimit: 2
      template:
        spec:
          restartPolicy: Never
          containers:
          - name: sync
            image: myapp/sync-service:v3.1
            command: ["./sync.sh"]

concurrencyPolicy options:

  • Allow: Multiple Jobs can run simultaneously (default)
  • Forbid: Skip new run if previous Job still active
  • Replace: Cancel running Job and start new one

Error Handling and Retry Logic

Robust error handling prevents resource waste and ensures reliable completion. Configure retry behavior with backoffLimit and activeDeadlineSeconds:

apiVersion: batch/v1
kind: Job
metadata:
  name: resilient-processor
spec:
  backoffLimit: 6  # Maximum retries
  activeDeadlineSeconds: 3600  # Fail after 1 hour total
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: processor
        image: myapp/processor:v2.3
        command: ["python", "process.py"]
        env:
        - name: MAX_RETRIES
          value: "3"
        - name: RETRY_DELAY
          value: "10"
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"

Kubernetes applies exponential backoff between retries: 10s, 20s, 40s, up to 6 minutes. The activeDeadlineSeconds provides an absolute timeout regardless of retry count.

Design idempotent jobs that can safely retry. Use unique transaction IDs, check for existing output before processing, and implement proper state management. Avoid jobs that partially complete and can’t resume—they’ll waste resources on every retry.

Monitoring and Troubleshooting

Track Job execution with kubectl commands:

# List all jobs and their status
kubectl get jobs

# Detailed job information
kubectl describe job data-processor

# View job events
kubectl get events --field-selector involvedObject.name=data-processor

# Get pods created by job
kubectl get pods --selector=job-name=data-processor

# View logs from job pods
kubectl logs job/data-processor

# Follow logs from latest pod
kubectl logs -f job/data-processor --tail=50

# Check CronJob schedule and last run
kubectl get cronjobs
kubectl describe cronjob database-backup

Monitor job completion programmatically:

# Wait for job completion (exits 0 on success)
kubectl wait --for=condition=complete --timeout=600s job/data-processor

# Check for failure
kubectl wait --for=condition=failed --timeout=600s job/data-processor

Jobs create events visible in kubectl describe output. Failed jobs remain in the cluster for debugging—inspect pod logs and exit codes to diagnose issues.

Production Best Practices

Production-ready CronJob with comprehensive configuration:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: production-etl
  namespace: data-pipeline
spec:
  schedule: "*/30 * * * *"  # Every 30 minutes
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 2
  startingDeadlineSeconds: 600
  jobTemplate:
    spec:
      ttlSecondsAfterFinished: 7200  # Auto-delete after 2 hours
      backoffLimit: 3
      activeDeadlineSeconds: 1800
      template:
        metadata:
          labels:
            app: etl-pipeline
            component: data-processor
        spec:
          serviceAccountName: etl-service-account
          restartPolicy: OnFailure
          containers:
          - name: etl
            image: myapp/etl-processor:v4.2.1
            imagePullPolicy: IfNotPresent
            command: ["python", "-u", "etl_pipeline.py"]
            env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: database-credentials
                  key: connection-string
            - name: S3_BUCKET
              valueFrom:
                configMapKeyRef:
                  name: etl-config
                  key: output-bucket
            resources:
              requests:
                memory: "2Gi"
                cpu: "1000m"
              limits:
                memory: "4Gi"
                cpu: "2000m"
            securityContext:
              runAsNonRoot: true
              runAsUser: 1000
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true
            volumeMounts:
            - name: temp
              mountPath: /tmp
          volumes:
          - name: temp
            emptyDir: {}

Critical production considerations:

TTL Controllers: Set ttlSecondsAfterFinished to automatically clean up completed Jobs. Without this, Jobs accumulate indefinitely.

Resource Limits: Always specify requests and limits. Batch jobs often process variable data sizes—limits prevent resource exhaustion.

Security Context: Run as non-root user, disable privilege escalation, and use read-only root filesystems when possible.

Service Accounts: Create dedicated service accounts with minimal permissions for accessing secrets, APIs, or cloud resources.

Monitoring Integration: Add labels for Prometheus scraping, emit metrics from job code, and configure alerting for failed jobs.

Batch processing in Kubernetes provides powerful primitives for reliable task execution. Jobs guarantee completion with configurable parallelism and retry logic. CronJobs automate recurring work with familiar scheduling syntax. Proper configuration of timeouts, resource limits, and cleanup policies ensures efficient, reliable batch workloads in production environments.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.