Canary Deployment: Gradual Traffic Shifting

Key Insights

Canary deployments reduce risk by routing a small percentage of traffic to new versions while monitoring for issues, allowing quick rollback before full exposure
Modern service meshes like Istio and tools like Flagger automate traffic shifting and rollback decisions based on real-time metrics, eliminating manual intervention
Session affinity and stateful operations require careful handling—use feature flags to decouple code deployment from feature activation when dealing with database migrations

Introduction to Canary Deployments

Canary deployments take their name from the coal miners who brought canaries into mines to detect toxic gases. If the canary stopped singing, miners knew to evacuate. In software deployment, the principle is identical: expose a small group of users to changes first, and if something goes wrong, you’ve limited the blast radius.

The core concept is straightforward. You deploy a new version of your application alongside the current stable version, route a small percentage of traffic to the new version, monitor key metrics, and gradually increase traffic if everything looks healthy. If metrics degrade, you immediately route all traffic back to the stable version.

This approach addresses a fundamental problem with traditional deployment strategies: you can’t predict every failure mode in production. No amount of testing in staging environments perfectly replicates production traffic patterns, data distribution, or edge cases. Canary deployments let production itself validate your changes with minimal risk.

How Canary Deployments Work

The mechanics involve running two versions of your application simultaneously. Your load balancer or service mesh splits traffic between them according to configured weights. You start with a small percentage—typically 5-10%—going to the canary version.

During this initial phase, you monitor metrics intensively. Error rates, response times, CPU usage, memory consumption, and business-specific KPIs all get tracked. If the canary version performs as well as or better than the stable version, you gradually increase the traffic percentage: 10%, 25%, 50%, 75%, and finally 100%.

Here’s a basic NGINX Ingress configuration showing a 95/5 traffic split:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "5"
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: app-canary
            port:
              number: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress-stable
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: app-stable
            port:
              number: 80

The decision to promote or rollback depends on predefined success criteria. If error rates spike above threshold or latency increases beyond acceptable limits, you immediately shift all traffic back to stable. If metrics remain healthy for a defined period, you proceed with the next traffic increment.

Implementation with Kubernetes

Kubernetes provides the primitives needed for canary deployments through Deployments, Services, and Ingress resources. The pattern involves creating separate Deployments for stable and canary versions, both selected by the same Service, with traffic splitting handled at the ingress level.

Here’s a complete Kubernetes setup:

# Stable version deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-stable
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
      version: stable
  template:
    metadata:
      labels:
        app: myapp
        version: stable
    spec:
      containers:
      - name: app
        image: myapp:v1.2.0
        ports:
        - containerPort: 80
---
# Canary version deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
      version: canary
  template:
    metadata:
      labels:
        app: myapp
        version: canary
    spec:
      containers:
      - name: app
        image: myapp:v1.3.0
        ports:
        - containerPort: 80
---
# Service for stable version
apiVersion: v1
kind: Service
metadata:
  name: app-stable
spec:
  selector:
    app: myapp
    version: stable
  ports:
  - port: 80
    targetPort: 80
---
# Service for canary version
apiVersion: v1
kind: Service
metadata:
  name: app-canary
spec:
  selector:
    app: myapp
    version: canary
  ports:
  - port: 80
    targetPort: 80

For more sophisticated traffic management, service meshes like Istio provide fine-grained control:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: app-canary
spec:
  hosts:
  - app.example.com
  http:
  - match:
    - headers:
        user-agent:
          regex: ".*Mobile.*"
    route:
    - destination:
        host: app-stable
        subset: stable
      weight: 100
  - route:
    - destination:
        host: app-stable
        subset: stable
      weight: 95
    - destination:
        host: app-canary
        subset: canary
      weight: 5
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: app-destination
spec:
  host: app
  subsets:
  - name: stable
    labels:
      version: stable
  - name: canary
    labels:
      version: canary

This Istio configuration demonstrates an important capability: you can route specific user segments differently. In this example, mobile users always get the stable version while desktop users participate in the canary.

Automated Traffic Shifting Strategies

Manual traffic shifting works for small teams, but it doesn’t scale. Tools like Flagger and Argo Rollouts automate the entire canary process based on metrics analysis.

Flagger integrates with service meshes and monitors metrics from Prometheus. Here’s a Flagger canary configuration:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: app
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  service:
    port: 80
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m
    webhooks:
    - name: load-test
      url: http://flagger-loadtester/
      timeout: 5s
      metadata:
        type: cmd
        cmd: "hey -z 1m -q 10 -c 2 http://app-canary/"

This configuration tells Flagger to:

Increase canary traffic by 10% every minute
Require 99% success rate and sub-500ms latency
Run load tests during analysis
Rollback if metrics fail 5 consecutive checks
Cap canary traffic at 50% before final promotion

Argo Rollouts provides similar capabilities with additional strategies like blue-green and experimentation:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: app
spec:
  replicas: 5
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - setWeight: 20
      - pause: {duration: 5m}
      - setWeight: 40
      - pause: {duration: 5m}
      - setWeight: 60
      - pause: {duration: 5m}
      - setWeight: 80
      - pause: {duration: 5m}
      analysis:
        templates:
        - templateName: success-rate
        startingStep: 2
        args:
        - name: service-name
          value: app-canary
  template:
    spec:
      containers:
      - name: app
        image: myapp:v1.3.0

Monitoring and Rollback Criteria

Effective canary deployments depend entirely on comprehensive monitoring. You need to define what “healthy” means for your application and automatically detect deviations.

Key metrics to monitor:

HTTP error rates: 5xx errors indicate server problems, 4xx might indicate breaking API changes
Latency percentiles: P50, P95, and P99 response times
Resource utilization: CPU and memory usage patterns
Business metrics: Conversion rates, signup completions, checkout success

Here are Prometheus queries for canary health:

# Error rate comparison
- name: canary_error_rate
  query: |
    sum(rate(http_requests_total{status=~"5..", version="canary"}[5m])) 
    / 
    sum(rate(http_requests_total{version="canary"}[5m]))    

# Latency comparison
- name: canary_latency_p95
  query: |
    histogram_quantile(0.95, 
      sum(rate(http_request_duration_seconds_bucket{version="canary"}[5m])) by (le)
    )    

# Comparison alert
- alert: CanaryHighErrorRate
  expr: |
    (sum(rate(http_requests_total{status=~"5..", version="canary"}[5m])) 
     / sum(rate(http_requests_total{version="canary"}[5m])))
    >
    (sum(rate(http_requests_total{status=~"5..", version="stable"}[5m])) 
     / sum(rate(http_requests_total{version="stable"}[5m]))) * 1.5    
  for: 2m
  annotations:
    summary: "Canary error rate 50% higher than stable"

Set rollback thresholds conservatively. A 50% increase in error rate or a 2x increase in P95 latency should trigger immediate rollback. Don’t wait for catastrophic failures.

Best Practices and Common Pitfalls

Session Affinity: If your application maintains session state, ensure users stick to the same version throughout their session. Configure your load balancer with session affinity:

apiVersion: v1
kind: Service
metadata:
  name: app
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
    service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "60"
spec:
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800

Database Migrations: Never couple database schema changes with canary deployments. Use backward-compatible migrations and feature flags:

// Feature flag to control new column usage
func getUserProfile(userID string) (*Profile, error) {
    profile := &Profile{}
    
    if featureflags.IsEnabled("use_new_profile_schema") {
        // Query includes new columns
        err := db.QueryRow(`
            SELECT id, name, email, preferences, new_settings 
            FROM users WHERE id = $1`, userID).Scan(
            &profile.ID, &profile.Name, &profile.Email, 
            &profile.Preferences, &profile.NewSettings)
    } else {
        // Old query, backward compatible
        err := db.QueryRow(`
            SELECT id, name, email, preferences 
            FROM users WHERE id = $1`, userID).Scan(
            &profile.ID, &profile.Name, &profile.Email, 
            &profile.Preferences)
    }
    
    return profile, err
}

Testing Strategy: Don’t rely solely on production canaries. Run comprehensive integration tests, load tests in staging, and chaos engineering experiments before deployment. Canaries catch unexpected issues, not known problems.

Observability: Tag all metrics, logs, and traces with version labels. This makes debugging and comparison trivial:

labels := prometheus.Labels{
    "version": os.Getenv("APP_VERSION"),
    "environment": "production",
}
requestCounter := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests",
    },
    []string{"status", "method", "version"},
)

Conclusion

Canary deployments shine when you need confidence in production changes without risking your entire user base. They’re ideal for user-facing services, high-traffic applications, and any system where downtime is costly.

Use blue-green deployments when you need instant rollback without traffic splitting complexity. Use rolling updates for stateless applications where gradual replacement is acceptable and traffic splitting isn’t necessary.

The key to successful canary deployments is automation. Manual traffic shifting and metric monitoring don’t scale and introduce human error. Invest in proper tooling—Flagger, Argo Rollouts, or equivalent—and define clear success criteria before you deploy.

Start small. Run your first canary with 5% traffic for 10 minutes. Monitor everything. Gradually increase your confidence and your traffic percentages. The goal isn’t zero risk—it’s controlled, measurable risk that you can react to before it becomes a crisis.