Go Retry Pattern: Exponential Backoff
Distributed systems fail. Networks drop packets, services hit rate limits, databases experience temporary connection issues, and downstream APIs occasionally return 503s. These transient failures are...
Key Insights
- Exponential backoff prevents retry storms by progressively increasing wait times between attempts, reducing load on struggling services while maintaining reasonable recovery times
- Adding jitter (random variation) to backoff intervals prevents the thundering herd problem where multiple clients retry simultaneously after failures
- Production retry logic must respect context cancellation, distinguish between retryable and non-retryable errors, and ensure operations are idempotent
Why Retry Logic Matters in Distributed Systems
Distributed systems fail. Networks drop packets, services hit rate limits, databases experience temporary connection issues, and downstream APIs occasionally return 503s. These transient failures are temporary and often resolve within seconds or minutes.
The naive solution—immediately retrying failed operations—creates more problems than it solves. When a service struggles under load, hundreds of clients simultaneously retrying requests amplify the problem, potentially triggering a cascading failure. You need a smarter approach.
Exponential backoff solves this by introducing progressively longer delays between retry attempts. Instead of hammering a failing service, you give it breathing room to recover while still eventually succeeding when the transient issue resolves.
Understanding Exponential Backoff
Exponential backoff increases wait times geometrically between retries. A typical sequence looks like:
- Attempt 1: Immediate
- Attempt 2: Wait 1 second
- Attempt 3: Wait 2 seconds
- Attempt 4: Wait 4 seconds
- Attempt 5: Wait 8 seconds
- Attempt 6: Wait 16 seconds
The wait time follows the formula: delay = baseDelay * (2 ^ attempt). This pattern balances two competing goals: recovering quickly from brief hiccups while avoiding overwhelming services experiencing genuine problems.
The benefits are substantial. Exponential backoff reduces aggregate load on failing services, improves overall success rates by allowing time for recovery, and prevents retry storms that can take down entire systems.
Basic Implementation
Let’s build a simple exponential backoff retry mechanism from scratch. We’ll start with a function that retries an operation with configurable parameters:
package retry
import (
"errors"
"fmt"
"time"
)
type Config struct {
MaxRetries int
BaseDelay time.Duration
}
func WithBackoff(config Config, operation func() error) error {
var err error
for attempt := 0; attempt <= config.MaxRetries; attempt++ {
err = operation()
if err == nil {
return nil
}
if attempt == config.MaxRetries {
break
}
delay := config.BaseDelay * time.Duration(1<<attempt)
fmt.Printf("Attempt %d failed: %v. Retrying in %v...\n",
attempt+1, err, delay)
time.Sleep(delay)
}
return fmt.Errorf("operation failed after %d attempts: %w",
config.MaxRetries+1, err)
}
Here’s how you’d use it with an HTTP request:
package main
import (
"fmt"
"net/http"
"time"
)
func main() {
config := retry.Config{
MaxRetries: 5,
BaseDelay: 1 * time.Second,
}
err := retry.WithBackoff(config, func() error {
resp, err := http.Get("https://api.example.com/data")
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return fmt.Errorf("unexpected status: %d", resp.StatusCode)
}
return nil
})
if err != nil {
fmt.Printf("Request failed: %v\n", err)
}
}
This basic implementation works, but it has limitations. All clients using identical base delays will retry at the same intervals, potentially creating synchronized retry waves.
Adding Jitter and Context Support
Jitter introduces randomness to backoff intervals, preventing the thundering herd problem. When 1000 clients experience a simultaneous failure, jitter ensures they don’t all retry at exactly the same moment.
Here’s an enhanced implementation with jitter and context support:
package retry
import (
"context"
"fmt"
"math/rand"
"time"
)
type Config struct {
MaxRetries int
BaseDelay time.Duration
MaxDelay time.Duration
}
func WithBackoff(ctx context.Context, config Config, operation func() error) error {
var err error
for attempt := 0; attempt <= config.MaxRetries; attempt++ {
// Check context before attempting
if ctx.Err() != nil {
return fmt.Errorf("context cancelled: %w", ctx.Err())
}
err = operation()
if err == nil {
return nil
}
if attempt == config.MaxRetries {
break
}
delay := calculateDelayWithJitter(attempt, config)
fmt.Printf("Attempt %d failed: %v. Retrying in %v...\n",
attempt+1, err, delay)
select {
case <-time.After(delay):
// Continue to next attempt
case <-ctx.Done():
return fmt.Errorf("context cancelled during backoff: %w", ctx.Err())
}
}
return fmt.Errorf("operation failed after %d attempts: %w",
config.MaxRetries+1, err)
}
func calculateDelayWithJitter(attempt int, config Config) time.Duration {
// Calculate exponential backoff
delay := config.BaseDelay * time.Duration(1<<attempt)
// Apply max delay cap
if delay > config.MaxDelay {
delay = config.MaxDelay
}
// Add jitter: randomize between 50% and 100% of calculated delay
jitter := delay/2 + time.Duration(rand.Int63n(int64(delay/2)))
return jitter
}
The jitter calculation ensures delays vary by ±25% from the base exponential value. This small randomization dramatically reduces synchronized retry storms.
Context support is critical for production code. It allows callers to cancel retry operations when requests timeout or users cancel operations:
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
config := retry.Config{
MaxRetries: 5,
BaseDelay: 1 * time.Second,
MaxDelay: 30 * time.Second,
}
err := retry.WithBackoff(ctx, config, func() error {
return callExternalAPI()
})
Production-Ready Implementation
A production retry package needs sophisticated error handling, configurable retry policies, and observability. Here’s a more complete implementation:
package retry
import (
"context"
"errors"
"fmt"
"math/rand"
"net/http"
"time"
)
type RetryableFunc func() error
type Policy struct {
MaxRetries int
BaseDelay time.Duration
MaxDelay time.Duration
ShouldRetry func(error) bool
OnRetry func(attempt int, err error, delay time.Duration)
}
var DefaultPolicy = Policy{
MaxRetries: 3,
BaseDelay: 1 * time.Second,
MaxDelay: 30 * time.Second,
ShouldRetry: func(err error) bool {
return true
},
OnRetry: func(attempt int, err error, delay time.Duration) {
// Default: no-op
},
}
func Do(ctx context.Context, policy Policy, fn RetryableFunc) error {
var lastErr error
for attempt := 0; attempt <= policy.MaxRetries; attempt++ {
if ctx.Err() != nil {
return ctx.Err()
}
lastErr = fn()
if lastErr == nil {
return nil
}
// Check if error is retryable
if !policy.ShouldRetry(lastErr) {
return fmt.Errorf("non-retryable error: %w", lastErr)
}
if attempt == policy.MaxRetries {
break
}
delay := calculateDelay(attempt, policy)
policy.OnRetry(attempt, lastErr, delay)
select {
case <-time.After(delay):
case <-ctx.Done():
return ctx.Err()
}
}
return fmt.Errorf("max retries exceeded: %w", lastErr)
}
func calculateDelay(attempt int, policy Policy) time.Duration {
delay := policy.BaseDelay * time.Duration(1<<attempt)
if delay > policy.MaxDelay {
delay = policy.MaxDelay
}
jitter := delay/2 + time.Duration(rand.Int63n(int64(delay/2)))
return jitter
}
// IsRetryableHTTPStatus determines if an HTTP status code should be retried
func IsRetryableHTTPStatus(statusCode int) bool {
return statusCode == http.StatusTooManyRequests ||
statusCode == http.StatusServiceUnavailable ||
statusCode == http.StatusGatewayTimeout ||
statusCode >= 500
}
Using this production-ready package with custom policies:
policy := retry.Policy{
MaxRetries: 5,
BaseDelay: 500 * time.Millisecond,
MaxDelay: 10 * time.Second,
ShouldRetry: func(err error) bool {
// Don't retry client errors
var httpErr *HTTPError
if errors.As(err, &httpErr) {
return retry.IsRetryableHTTPStatus(httpErr.StatusCode)
}
return true
},
OnRetry: func(attempt int, err error, delay time.Duration) {
log.Printf("Retry attempt %d after error: %v (waiting %v)",
attempt+1, err, delay)
metrics.IncrementRetryCounter()
},
}
err := retry.Do(ctx, policy, func() error {
return makeAPICall()
})
Common Pitfalls and Best Practices
Ensure Idempotency: Retry logic only works safely with idempotent operations. If retrying a request could create duplicate orders or double-charge customers, you need idempotency keys or other safeguards.
Don’t Retry Everything: Client errors (4xx status codes) typically indicate problems with your request, not transient failures. Retrying a 400 Bad Request or 404 Not Found wastes resources.
Integrate Circuit Breakers: Exponential backoff handles individual request failures. Circuit breakers prevent cascading failures by stopping requests to consistently failing services. Use them together.
Test Your Retry Logic: Table-driven tests verify backoff behavior:
func TestRetryBackoff(t *testing.T) {
tests := []struct {
name string
failures int
expectedCalls int
}{
{"succeeds first try", 0, 1},
{"succeeds after 2 failures", 2, 3},
{"exhausts retries", 10, 4},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
calls := 0
policy := retry.Policy{MaxRetries: 3, BaseDelay: 1 * time.Millisecond}
err := retry.Do(context.Background(), policy, func() error {
calls++
if calls <= tt.failures {
return errors.New("temporary failure")
}
return nil
})
if calls != tt.expectedCalls {
t.Errorf("expected %d calls, got %d", tt.expectedCalls, calls)
}
if tt.failures >= 4 && err == nil {
t.Error("expected error after exhausting retries")
}
})
}
}
Conclusion
Exponential backoff is essential for building resilient distributed systems. It prevents retry storms, gives struggling services time to recover, and improves overall system reliability.
The key principles: start with reasonable base delays (500ms-2s), add jitter to prevent thundering herds, respect context cancellation, and only retry transient failures. Always ensure your operations are idempotent before adding retry logic.
For production systems, consider battle-tested libraries like cenkalti/backoff or avast/retry-go rather than rolling your own. They handle edge cases and provide additional features like exponential backoff with decorrelated jitter.
Implement retry logic thoughtfully. Combined with circuit breakers, timeouts, and proper monitoring, exponential backoff transforms brittle integrations into robust, self-healing systems.