Kubernetes Horizontal Pod Autoscaler: Auto-Scaling

Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas in a deployment, replica set, or stateful set based on observed metrics. In production environments, traffic patterns...

Key Insights

  • Horizontal Pod Autoscaler requires resource requests defined in your deployments and a functioning metrics-server to calculate scaling decisions based on actual resource utilization
  • HPA uses a control loop that checks metrics every 15 seconds by default but only scales when the calculated replica count differs from current state by more than 10% to prevent flapping
  • Custom and external metrics unlock sophisticated scaling strategies beyond CPU/memory, enabling autoscaling based on application-specific signals like queue depth or requests per second

Introduction to Horizontal Pod Autoscaling

Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas in a deployment, replica set, or stateful set based on observed metrics. In production environments, traffic patterns are rarely constant—you face morning spikes, lunch lulls, and unexpected viral moments. Manual scaling is reactive, error-prone, and wastes engineering time. HPA solves this by continuously monitoring your workloads and adjusting capacity automatically.

Kubernetes offers two primary scaling approaches: horizontal and vertical. Horizontal scaling adds more pod replicas across nodes, distributing load. Vertical scaling (VPA) adjusts CPU and memory resources allocated to existing pods. HPA is generally preferred for stateless applications because it provides better fault tolerance and can scale beyond single-node capacity. Most production systems use HPA as their primary autoscaling mechanism, reserving VPA for workloads with unpredictable resource requirements.

How HPA Works Under the Hood

HPA operates as a control loop running in the Kubernetes control plane. Every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period), the HPA controller queries the metrics server for resource utilization data. It then calculates the desired replica count using this formula:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]

For CPU utilization, if your target is 70% and current average utilization across pods is 140%, HPA calculates: currentReplicas * (140/70) = currentReplicas * 2. The controller rounds up and initiates scaling.

The metrics-server aggregates resource metrics from kubelets running on each node. It exposes these through the Metrics API (metrics.k8s.io), which HPA queries. For custom metrics, you’ll need additional components like the Prometheus Adapter or custom metrics API implementations.

HPA includes tolerance mechanisms to prevent flapping. It only scales when the calculated change exceeds 10% of the current replica count. Additionally, after a scale-up event, HPA waits 3 minutes before scaling down (configurable via --horizontal-pod-autoscaler-downscale-stabilization).

Here’s a basic HPA targeting 70% CPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: webapp-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: webapp
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Setting Up Your First HPA

Before creating an HPA, ensure metrics-server is installed in your cluster. Verify with:

kubectl get deployment metrics-server -n kube-system

Your deployment must specify resource requests. HPA calculates utilization as a percentage of requested resources—without requests, HPA cannot function. Here’s a properly configured deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
  namespace: production
spec:
  replicas: 2
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      containers:
      - name: webapp
        image: myapp:v1.2.0
        resources:
          requests:
            cpu: 200m
            memory: 256Mi
          limits:
            cpu: 500m
            memory: 512Mi
        ports:
        - containerPort: 8080

Create the HPA using either YAML or the imperative command:

# Imperative approach
kubectl autoscale deployment webapp \
  --cpu-percent=70 \
  --min=2 \
  --max=10 \
  -n production

# Declarative approach (recommended)
kubectl apply -f webapp-hpa.yaml

To test your HPA, generate load and observe scaling behavior:

# Watch HPA status
kubectl get hpa webapp-hpa -n production --watch

# In another terminal, generate load
kubectl run -it --rm load-generator \
  --image=busybox \
  --restart=Never \
  -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://webapp:8080; done"

# Monitor pod count
kubectl get pods -n production -l app=webapp --watch

You should see the HPA detect increased CPU utilization and scale up pods within 1-2 minutes.

Custom Metrics and Advanced Scaling Strategies

CPU and memory are starting points, but application-aware metrics provide superior autoscaling. Scale based on requests per second, queue depth, or custom business metrics. This requires the Custom Metrics API (custom.metrics.k8s.io) or External Metrics API (external.metrics.k8s.io).

The Prometheus Adapter is the most popular implementation. First, install it in your cluster, then configure it to expose metrics. Here’s an HPA using custom metrics:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: webapp-hpa-custom
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: webapp
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  - type: External
    external:
      metric:
        name: sqs_queue_depth
        selector:
          matchLabels:
            queue_name: "processing-queue"
      target:
        type: Value
        value: "30"

This HPA scales based on requests per second per pod and an external SQS queue depth metric. When average RPS exceeds 1000 or queue depth exceeds 30 messages, HPA scales up.

For Prometheus Adapter configuration, you’d create a ConfigMap defining metric queries:

rules:
- seriesQuery: 'http_requests_total{namespace="production",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^(.*)_total"
    as: "${1}_per_second"
  metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'

HPA Configuration Best Practices

Modern HPA (v2) supports sophisticated behavior configuration. Control scale-up and scale-down rates, stabilization windows, and policies to prevent aggressive scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: webapp-hpa-advanced
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: webapp
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      selectPolicy: Min

This configuration allows aggressive scale-up (doubling replicas or adding 4 pods every 15 seconds, whichever is greater) but conservative scale-down (maximum 50% reduction per minute, with a 5-minute stabilization window). The stabilization window prevents premature scale-down during temporary traffic dips.

Set minReplicas based on baseline traffic and redundancy requirements—never use 1 in production. Set maxReplicas considering cluster capacity and cost constraints. Include a buffer for unexpected spikes.

When using multiple metrics, HPA calculates desired replicas for each metric independently and uses the highest value. This ensures all constraints are satisfied.

Troubleshooting Common Issues

The most common HPA failure is “unable to get metrics.” Check these systematically:

# Verify metrics-server is running
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml

# Check if metrics are available
kubectl top nodes
kubectl top pods -n production

# Examine HPA status
kubectl describe hpa webapp-hpa -n production

Look for these sections in describe output:

Conditions:
  Type            Status  Reason                   Message
  ----            ------  ------                   -------
  AbleToScale     True    ReadyForNewScale         recommended size matches current size
  ScalingActive   True    ValidMetricFound         the HPA was able to successfully calculate a replica count
  ScalingLimited  False   DesiredWithinRange       the desired count is within the acceptable range

If ScalingActive is False, check the message. Common issues:

Missing resource requests: HPA cannot calculate utilization percentage without requests defined.

Metrics server unavailable: Ensure metrics-server pods are running and the API service is registered.

Insufficient metrics: HPA needs metrics from all pods. If pods are just starting, wait 1-2 minutes.

Debug custom metrics with:

# List available custom metrics
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .

# Query specific metric
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" | jq .

Monitor HPA decisions in real-time:

kubectl get hpa webapp-hpa -n production --watch

# Check events
kubectl get events -n production --field-selector involvedObject.name=webapp-hpa --sort-by='.lastTimestamp'

Conclusion and Production Considerations

HPA is essential for production Kubernetes deployments running variable workloads. Use it for stateless applications where adding replicas improves capacity linearly. It’s less suitable for stateful applications, databases, or workloads with high startup costs.

Cost implications matter. Aggressive autoscaling can significantly increase cloud bills during traffic spikes. Set reasonable maxReplicas values and monitor spending. Consider using Cluster Autoscaler alongside HPA to add nodes when pods are pending, but be aware of the 3-5 minute node provisioning delay.

Combining HPA with VPA is possible but requires careful configuration—set VPA to “Off” mode for recommendations only, or use VPA for resource requests while HPA manages replica count. Never let both modify the same dimension simultaneously.

For batch processing workloads, consider KEDA (Kubernetes Event-Driven Autoscaling) instead, which can scale to zero and supports event sources like Kafka, RabbitMQ, and cloud queues natively.

Always test your HPA configuration under realistic load before production deployment. Use load testing tools to simulate traffic patterns and verify scaling behavior matches expectations. Monitor HPA metrics and events in your observability stack to understand scaling patterns and optimize configurations over time.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.