Go Retry Pattern: Exponential Backoff

Key Insights

Exponential backoff prevents retry storms by progressively increasing wait times between attempts, reducing load on struggling services while maintaining reasonable recovery times
Adding jitter (random variation) to backoff intervals prevents the thundering herd problem where multiple clients retry simultaneously after failures
Production retry logic must respect context cancellation, distinguish between retryable and non-retryable errors, and ensure operations are idempotent

Why Retry Logic Matters in Distributed Systems

Distributed systems fail. Networks drop packets, services hit rate limits, databases experience temporary connection issues, and downstream APIs occasionally return 503s. These transient failures are temporary and often resolve within seconds or minutes.

The naive solution—immediately retrying failed operations—creates more problems than it solves. When a service struggles under load, hundreds of clients simultaneously retrying requests amplify the problem, potentially triggering a cascading failure. You need a smarter approach.

Exponential backoff solves this by introducing progressively longer delays between retry attempts. Instead of hammering a failing service, you give it breathing room to recover while still eventually succeeding when the transient issue resolves.

Understanding Exponential Backoff

Exponential backoff increases wait times geometrically between retries. A typical sequence looks like:

Attempt 1: Immediate
Attempt 2: Wait 1 second
Attempt 3: Wait 2 seconds
Attempt 4: Wait 4 seconds
Attempt 5: Wait 8 seconds
Attempt 6: Wait 16 seconds

The wait time follows the formula: delay = baseDelay * (2 ^ attempt). This pattern balances two competing goals: recovering quickly from brief hiccups while avoiding overwhelming services experiencing genuine problems.

The benefits are substantial. Exponential backoff reduces aggregate load on failing services, improves overall success rates by allowing time for recovery, and prevents retry storms that can take down entire systems.

Basic Implementation

Let’s build a simple exponential backoff retry mechanism from scratch. We’ll start with a function that retries an operation with configurable parameters:

package retry

import (
    "errors"
    "fmt"
    "time"
)

type Config struct {
    MaxRetries int
    BaseDelay  time.Duration
}

func WithBackoff(config Config, operation func() error) error {
    var err error
    
    for attempt := 0; attempt <= config.MaxRetries; attempt++ {
        err = operation()
        if err == nil {
            return nil
        }
        
        if attempt == config.MaxRetries {
            break
        }
        
        delay := config.BaseDelay * time.Duration(1<<attempt)
        fmt.Printf("Attempt %d failed: %v. Retrying in %v...\n", 
            attempt+1, err, delay)
        time.Sleep(delay)
    }
    
    return fmt.Errorf("operation failed after %d attempts: %w", 
        config.MaxRetries+1, err)
}

Here’s how you’d use it with an HTTP request:

package main

import (
    "fmt"
    "net/http"
    "time"
)

func main() {
    config := retry.Config{
        MaxRetries: 5,
        BaseDelay:  1 * time.Second,
    }
    
    err := retry.WithBackoff(config, func() error {
        resp, err := http.Get("https://api.example.com/data")
        if err != nil {
            return err
        }
        defer resp.Body.Close()
        
        if resp.StatusCode != http.StatusOK {
            return fmt.Errorf("unexpected status: %d", resp.StatusCode)
        }
        
        return nil
    })
    
    if err != nil {
        fmt.Printf("Request failed: %v\n", err)
    }
}

This basic implementation works, but it has limitations. All clients using identical base delays will retry at the same intervals, potentially creating synchronized retry waves.

Adding Jitter and Context Support

Jitter introduces randomness to backoff intervals, preventing the thundering herd problem. When 1000 clients experience a simultaneous failure, jitter ensures they don’t all retry at exactly the same moment.

Here’s an enhanced implementation with jitter and context support:

package retry

import (
    "context"
    "fmt"
    "math/rand"
    "time"
)

type Config struct {
    MaxRetries int
    BaseDelay  time.Duration
    MaxDelay   time.Duration
}

func WithBackoff(ctx context.Context, config Config, operation func() error) error {
    var err error
    
    for attempt := 0; attempt <= config.MaxRetries; attempt++ {
        // Check context before attempting
        if ctx.Err() != nil {
            return fmt.Errorf("context cancelled: %w", ctx.Err())
        }
        
        err = operation()
        if err == nil {
            return nil
        }
        
        if attempt == config.MaxRetries {
            break
        }
        
        delay := calculateDelayWithJitter(attempt, config)
        
        fmt.Printf("Attempt %d failed: %v. Retrying in %v...\n", 
            attempt+1, err, delay)
        
        select {
        case <-time.After(delay):
            // Continue to next attempt
        case <-ctx.Done():
            return fmt.Errorf("context cancelled during backoff: %w", ctx.Err())
        }
    }
    
    return fmt.Errorf("operation failed after %d attempts: %w", 
        config.MaxRetries+1, err)
}

func calculateDelayWithJitter(attempt int, config Config) time.Duration {
    // Calculate exponential backoff
    delay := config.BaseDelay * time.Duration(1<<attempt)
    
    // Apply max delay cap
    if delay > config.MaxDelay {
        delay = config.MaxDelay
    }
    
    // Add jitter: randomize between 50% and 100% of calculated delay
    jitter := delay/2 + time.Duration(rand.Int63n(int64(delay/2)))
    
    return jitter
}

The jitter calculation ensures delays vary by ±25% from the base exponential value. This small randomization dramatically reduces synchronized retry storms.

Context support is critical for production code. It allows callers to cancel retry operations when requests timeout or users cancel operations:

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

config := retry.Config{
    MaxRetries: 5,
    BaseDelay:  1 * time.Second,
    MaxDelay:   30 * time.Second,
}

err := retry.WithBackoff(ctx, config, func() error {
    return callExternalAPI()
})

Production-Ready Implementation

A production retry package needs sophisticated error handling, configurable retry policies, and observability. Here’s a more complete implementation:

package retry

import (
    "context"
    "errors"
    "fmt"
    "math/rand"
    "net/http"
    "time"
)

type RetryableFunc func() error

type Policy struct {
    MaxRetries    int
    BaseDelay     time.Duration
    MaxDelay      time.Duration
    ShouldRetry   func(error) bool
    OnRetry       func(attempt int, err error, delay time.Duration)
}

var DefaultPolicy = Policy{
    MaxRetries: 3,
    BaseDelay:  1 * time.Second,
    MaxDelay:   30 * time.Second,
    ShouldRetry: func(err error) bool {
        return true
    },
    OnRetry: func(attempt int, err error, delay time.Duration) {
        // Default: no-op
    },
}

func Do(ctx context.Context, policy Policy, fn RetryableFunc) error {
    var lastErr error
    
    for attempt := 0; attempt <= policy.MaxRetries; attempt++ {
        if ctx.Err() != nil {
            return ctx.Err()
        }
        
        lastErr = fn()
        if lastErr == nil {
            return nil
        }
        
        // Check if error is retryable
        if !policy.ShouldRetry(lastErr) {
            return fmt.Errorf("non-retryable error: %w", lastErr)
        }
        
        if attempt == policy.MaxRetries {
            break
        }
        
        delay := calculateDelay(attempt, policy)
        policy.OnRetry(attempt, lastErr, delay)
        
        select {
        case <-time.After(delay):
        case <-ctx.Done():
            return ctx.Err()
        }
    }
    
    return fmt.Errorf("max retries exceeded: %w", lastErr)
}

func calculateDelay(attempt int, policy Policy) time.Duration {
    delay := policy.BaseDelay * time.Duration(1<<attempt)
    if delay > policy.MaxDelay {
        delay = policy.MaxDelay
    }
    jitter := delay/2 + time.Duration(rand.Int63n(int64(delay/2)))
    return jitter
}

// IsRetryableHTTPStatus determines if an HTTP status code should be retried
func IsRetryableHTTPStatus(statusCode int) bool {
    return statusCode == http.StatusTooManyRequests ||
        statusCode == http.StatusServiceUnavailable ||
        statusCode == http.StatusGatewayTimeout ||
        statusCode >= 500
}

Using this production-ready package with custom policies:

policy := retry.Policy{
    MaxRetries: 5,
    BaseDelay:  500 * time.Millisecond,
    MaxDelay:   10 * time.Second,
    ShouldRetry: func(err error) bool {
        // Don't retry client errors
        var httpErr *HTTPError
        if errors.As(err, &httpErr) {
            return retry.IsRetryableHTTPStatus(httpErr.StatusCode)
        }
        return true
    },
    OnRetry: func(attempt int, err error, delay time.Duration) {
        log.Printf("Retry attempt %d after error: %v (waiting %v)", 
            attempt+1, err, delay)
        metrics.IncrementRetryCounter()
    },
}

err := retry.Do(ctx, policy, func() error {
    return makeAPICall()
})

Common Pitfalls and Best Practices

Ensure Idempotency: Retry logic only works safely with idempotent operations. If retrying a request could create duplicate orders or double-charge customers, you need idempotency keys or other safeguards.

Don’t Retry Everything: Client errors (4xx status codes) typically indicate problems with your request, not transient failures. Retrying a 400 Bad Request or 404 Not Found wastes resources.

Integrate Circuit Breakers: Exponential backoff handles individual request failures. Circuit breakers prevent cascading failures by stopping requests to consistently failing services. Use them together.

Test Your Retry Logic: Table-driven tests verify backoff behavior:

func TestRetryBackoff(t *testing.T) {
    tests := []struct {
        name           string
        failures       int
        expectedCalls  int
    }{
        {"succeeds first try", 0, 1},
        {"succeeds after 2 failures", 2, 3},
        {"exhausts retries", 10, 4},
    }
    
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            calls := 0
            policy := retry.Policy{MaxRetries: 3, BaseDelay: 1 * time.Millisecond}
            
            err := retry.Do(context.Background(), policy, func() error {
                calls++
                if calls <= tt.failures {
                    return errors.New("temporary failure")
                }
                return nil
            })
            
            if calls != tt.expectedCalls {
                t.Errorf("expected %d calls, got %d", tt.expectedCalls, calls)
            }
            
            if tt.failures >= 4 && err == nil {
                t.Error("expected error after exhausting retries")
            }
        })
    }
}

Conclusion

Exponential backoff is essential for building resilient distributed systems. It prevents retry storms, gives struggling services time to recover, and improves overall system reliability.

The key principles: start with reasonable base delays (500ms-2s), add jitter to prevent thundering herds, respect context cancellation, and only retry transient failures. Always ensure your operations are idempotent before adding retry logic.

For production systems, consider battle-tested libraries like cenkalti/backoff or avast/retry-go rather than rolling your own. They handle edge cases and provide additional features like exponential backoff with decorrelated jitter.

Implement retry logic thoughtfully. Combined with circuit breakers, timeouts, and proper monitoring, exponential backoff transforms brittle integrations into robust, self-healing systems.