Kubernetes Horizontal Pod Autoscaler: Auto-Scaling
Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas in a deployment, replica set, or stateful set based on observed metrics. In production environments, traffic patterns...
Key Insights
- Horizontal Pod Autoscaler requires resource requests defined in your deployments and a functioning metrics-server to calculate scaling decisions based on actual resource utilization
- HPA uses a control loop that checks metrics every 15 seconds by default but only scales when the calculated replica count differs from current state by more than 10% to prevent flapping
- Custom and external metrics unlock sophisticated scaling strategies beyond CPU/memory, enabling autoscaling based on application-specific signals like queue depth or requests per second
Introduction to Horizontal Pod Autoscaling
Horizontal Pod Autoscaler (HPA) automatically adjusts the number of pod replicas in a deployment, replica set, or stateful set based on observed metrics. In production environments, traffic patterns are rarely constant—you face morning spikes, lunch lulls, and unexpected viral moments. Manual scaling is reactive, error-prone, and wastes engineering time. HPA solves this by continuously monitoring your workloads and adjusting capacity automatically.
Kubernetes offers two primary scaling approaches: horizontal and vertical. Horizontal scaling adds more pod replicas across nodes, distributing load. Vertical scaling (VPA) adjusts CPU and memory resources allocated to existing pods. HPA is generally preferred for stateless applications because it provides better fault tolerance and can scale beyond single-node capacity. Most production systems use HPA as their primary autoscaling mechanism, reserving VPA for workloads with unpredictable resource requirements.
How HPA Works Under the Hood
HPA operates as a control loop running in the Kubernetes control plane. Every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period), the HPA controller queries the metrics server for resource utilization data. It then calculates the desired replica count using this formula:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]
For CPU utilization, if your target is 70% and current average utilization across pods is 140%, HPA calculates: currentReplicas * (140/70) = currentReplicas * 2. The controller rounds up and initiates scaling.
The metrics-server aggregates resource metrics from kubelets running on each node. It exposes these through the Metrics API (metrics.k8s.io), which HPA queries. For custom metrics, you’ll need additional components like the Prometheus Adapter or custom metrics API implementations.
HPA includes tolerance mechanisms to prevent flapping. It only scales when the calculated change exceeds 10% of the current replica count. Additionally, after a scale-up event, HPA waits 3 minutes before scaling down (configurable via --horizontal-pod-autoscaler-downscale-stabilization).
Here’s a basic HPA targeting 70% CPU utilization:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Setting Up Your First HPA
Before creating an HPA, ensure metrics-server is installed in your cluster. Verify with:
kubectl get deployment metrics-server -n kube-system
Your deployment must specify resource requests. HPA calculates utilization as a percentage of requested resources—without requests, HPA cannot function. Here’s a properly configured deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp
namespace: production
spec:
replicas: 2
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
containers:
- name: webapp
image: myapp:v1.2.0
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
ports:
- containerPort: 8080
Create the HPA using either YAML or the imperative command:
# Imperative approach
kubectl autoscale deployment webapp \
--cpu-percent=70 \
--min=2 \
--max=10 \
-n production
# Declarative approach (recommended)
kubectl apply -f webapp-hpa.yaml
To test your HPA, generate load and observe scaling behavior:
# Watch HPA status
kubectl get hpa webapp-hpa -n production --watch
# In another terminal, generate load
kubectl run -it --rm load-generator \
--image=busybox \
--restart=Never \
-- /bin/sh -c "while sleep 0.01; do wget -q -O- http://webapp:8080; done"
# Monitor pod count
kubectl get pods -n production -l app=webapp --watch
You should see the HPA detect increased CPU utilization and scale up pods within 1-2 minutes.
Custom Metrics and Advanced Scaling Strategies
CPU and memory are starting points, but application-aware metrics provide superior autoscaling. Scale based on requests per second, queue depth, or custom business metrics. This requires the Custom Metrics API (custom.metrics.k8s.io) or External Metrics API (external.metrics.k8s.io).
The Prometheus Adapter is the most popular implementation. First, install it in your cluster, then configure it to expose metrics. Here’s an HPA using custom metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-hpa-custom
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
- type: External
external:
metric:
name: sqs_queue_depth
selector:
matchLabels:
queue_name: "processing-queue"
target:
type: Value
value: "30"
This HPA scales based on requests per second per pod and an external SQS queue depth metric. When average RPS exceeds 1000 or queue depth exceeds 30 messages, HPA scales up.
For Prometheus Adapter configuration, you’d create a ConfigMap defining metric queries:
rules:
- seriesQuery: 'http_requests_total{namespace="production",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total"
as: "${1}_per_second"
metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'
HPA Configuration Best Practices
Modern HPA (v2) supports sophisticated behavior configuration. Control scale-up and scale-down rates, stabilization windows, and policies to prevent aggressive scaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-hpa-advanced
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
selectPolicy: Min
This configuration allows aggressive scale-up (doubling replicas or adding 4 pods every 15 seconds, whichever is greater) but conservative scale-down (maximum 50% reduction per minute, with a 5-minute stabilization window). The stabilization window prevents premature scale-down during temporary traffic dips.
Set minReplicas based on baseline traffic and redundancy requirements—never use 1 in production. Set maxReplicas considering cluster capacity and cost constraints. Include a buffer for unexpected spikes.
When using multiple metrics, HPA calculates desired replicas for each metric independently and uses the highest value. This ensures all constraints are satisfied.
Troubleshooting Common Issues
The most common HPA failure is “unable to get metrics.” Check these systematically:
# Verify metrics-server is running
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml
# Check if metrics are available
kubectl top nodes
kubectl top pods -n production
# Examine HPA status
kubectl describe hpa webapp-hpa -n production
Look for these sections in describe output:
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
If ScalingActive is False, check the message. Common issues:
Missing resource requests: HPA cannot calculate utilization percentage without requests defined.
Metrics server unavailable: Ensure metrics-server pods are running and the API service is registered.
Insufficient metrics: HPA needs metrics from all pods. If pods are just starting, wait 1-2 minutes.
Debug custom metrics with:
# List available custom metrics
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq .
# Query specific metric
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/production/pods/*/http_requests_per_second" | jq .
Monitor HPA decisions in real-time:
kubectl get hpa webapp-hpa -n production --watch
# Check events
kubectl get events -n production --field-selector involvedObject.name=webapp-hpa --sort-by='.lastTimestamp'
Conclusion and Production Considerations
HPA is essential for production Kubernetes deployments running variable workloads. Use it for stateless applications where adding replicas improves capacity linearly. It’s less suitable for stateful applications, databases, or workloads with high startup costs.
Cost implications matter. Aggressive autoscaling can significantly increase cloud bills during traffic spikes. Set reasonable maxReplicas values and monitor spending. Consider using Cluster Autoscaler alongside HPA to add nodes when pods are pending, but be aware of the 3-5 minute node provisioning delay.
Combining HPA with VPA is possible but requires careful configuration—set VPA to “Off” mode for recommendations only, or use VPA for resource requests while HPA manages replica count. Never let both modify the same dimension simultaneously.
For batch processing workloads, consider KEDA (Kubernetes Event-Driven Autoscaling) instead, which can scale to zero and supports event sources like Kafka, RabbitMQ, and cloud queues natively.
Always test your HPA configuration under realistic load before production deployment. Use load testing tools to simulate traffic patterns and verify scaling behavior matches expectations. Monitor HPA metrics and events in your observability stack to understand scaling patterns and optimize configurations over time.