Error Handling: Strategies and Best Practices

Poor error handling costs more than most teams realize. It manifests as data corruption when partial operations complete without rollback, security vulnerabilities when error messages leak internal...

Key Insights

  • Error handling is a design decision, not an afterthought—choose between fail-fast and fail-safe strategies based on your system’s requirements and failure domains
  • Result types make errors explicit in your function signatures, eliminating the hidden control flow that exceptions create and forcing callers to handle failure cases
  • Every error should carry enough context for debugging while every retry strategy needs limits—unbounded retries and context-free errors are two of the fastest paths to production incidents

Introduction: Why Error Handling Matters

Poor error handling costs more than most teams realize. It manifests as data corruption when partial operations complete without rollback, security vulnerabilities when error messages leak internal details, and terrible user experiences when systems fail silently or with cryptic messages.

Before diving into strategies, let’s clarify terminology. An error is an incorrect or unexpected condition in your program. An exception is a language mechanism for handling errors by transferring control up the call stack. A failure is when a system can no longer fulfill its intended function. These distinctions matter because different situations call for different responses.

The goal of error handling isn’t to prevent all errors—that’s impossible. The goal is to maintain system integrity, provide useful feedback, and enable recovery when things go wrong.

Fail Fast vs. Fail Safe Strategies

The fail-fast philosophy says: when something goes wrong, stop immediately and loudly. Don’t try to continue with potentially corrupted state. This approach prevents cascading failures and makes bugs easier to find.

Fail-safe takes the opposite stance: when something goes wrong, degrade gracefully and keep the system running. This prioritizes availability over correctness.

Neither approach is universally correct. The right choice depends on context:

Fail fast when:

  • You’re at a system boundary (API input, file parsing, configuration loading)
  • Data integrity is paramount
  • Continuing would make the problem worse
  • You’re in development or testing environments

Fail safe when:

  • Partial functionality is better than none
  • The failing component is non-critical
  • You have fallback mechanisms
  • User experience depends on availability
# Fail fast: Input validation at system boundaries
from dataclasses import dataclass
from typing import Optional
import re

class ValidationError(Exception):
    def __init__(self, field: str, message: str):
        self.field = field
        self.message = message
        super().__init__(f"{field}: {message}")

@dataclass
class CreateUserRequest:
    email: str
    password: str
    age: Optional[int] = None
    
    def validate(self) -> None:
        """Fail fast on invalid input - don't let bad data propagate."""
        if not self.email or not re.match(r'^[^@]+@[^@]+\.[^@]+$', self.email):
            raise ValidationError('email', 'Invalid email format')
        
        if not self.password or len(self.password) < 12:
            raise ValidationError('password', 'Password must be at least 12 characters')
        
        if self.age is not None and (self.age < 0 or self.age > 150):
            raise ValidationError('age', 'Age must be between 0 and 150')

# Usage: validate at the boundary, trust internally
def create_user(request: CreateUserRequest) -> User:
    request.validate()  # Fails fast here if invalid
    # From this point, we trust the data is valid
    return user_repository.create(request)

For APIs, fail fast on bad input but fail safe on non-critical downstream services. For background jobs, fail fast and let the job scheduler handle retries. For user-facing apps, fail safe with graceful degradation whenever possible.

Exception Handling Patterns

The most common exception handling mistake is catching too broadly. Generic catch blocks hide bugs and make debugging nightmares.

# Bad: Swallows all errors including programming mistakes
try:
    process_payment(order)
except Exception:
    logger.error("Payment failed")
    return None  # What actually happened?

# Good: Catch specific exceptions, handle appropriately
try:
    process_payment(order)
except PaymentDeclinedException as e:
    notify_user_payment_declined(order, e.decline_reason)
    return PaymentResult.declined(e.decline_reason)
except PaymentGatewayTimeoutException as e:
    schedule_payment_retry(order)
    return PaymentResult.pending()
except InsufficientFundsException as e:
    return PaymentResult.insufficient_funds()
# Let unexpected exceptions propagate - they're bugs that need fixing

Build an exception hierarchy that reflects your domain:

class ApplicationError(Exception):
    """Base class for all application-specific errors."""
    def __init__(self, message: str, code: str, details: dict = None):
        self.message = message
        self.code = code
        self.details = details or {}
        super().__init__(message)

class ValidationError(ApplicationError):
    """Input validation failures."""
    def __init__(self, field: str, message: str):
        super().__init__(
            message=message,
            code='VALIDATION_ERROR',
            details={'field': field}
        )

class BusinessRuleViolation(ApplicationError):
    """Domain logic violations."""
    pass

class ResourceNotFoundError(ApplicationError):
    """Requested resource doesn't exist."""
    def __init__(self, resource_type: str, resource_id: str):
        super().__init__(
            message=f"{resource_type} with id '{resource_id}' not found",
            code='RESOURCE_NOT_FOUND',
            details={'resource_type': resource_type, 'resource_id': resource_id}
        )

class ExternalServiceError(ApplicationError):
    """Failures in external dependencies."""
    def __init__(self, service: str, message: str, retryable: bool = False):
        super().__init__(
            message=f"{service}: {message}",
            code='EXTERNAL_SERVICE_ERROR',
            details={'service': service, 'retryable': retryable}
        )

The decision tree for exception handling is simple: Can you meaningfully handle this error? If yes, handle it. If no, let it propagate. “Handling” means taking corrective action, not just logging and continuing.

Result Types and Explicit Error Handling

Exceptions create hidden control flow. A function that throws can exit at any point, and nothing in its signature tells you this. Result types make errors explicit.

from dataclasses import dataclass
from typing import TypeVar, Generic, Callable, Union

T = TypeVar('T')
E = TypeVar('E')
U = TypeVar('U')

@dataclass
class Ok(Generic[T]):
    value: T
    
    def is_ok(self) -> bool:
        return True
    
    def is_err(self) -> bool:
        return False

@dataclass  
class Err(Generic[E]):
    error: E
    
    def is_ok(self) -> bool:
        return False
    
    def is_err(self) -> bool:
        return True

Result = Union[Ok[T], Err[E]]

def map_result(result: Result[T, E], f: Callable[[T], U]) -> Result[U, E]:
    """Transform the success value, pass through errors."""
    if isinstance(result, Ok):
        return Ok(f(result.value))
    return result

def flat_map_result(result: Result[T, E], f: Callable[[T], Result[U, E]]) -> Result[U, E]:
    """Chain operations that might fail."""
    if isinstance(result, Ok):
        return f(result.value)
    return result

# Usage
def parse_int(s: str) -> Result[int, str]:
    try:
        return Ok(int(s))
    except ValueError:
        return Err(f"Cannot parse '{s}' as integer")

def divide(a: int, b: int) -> Result[float, str]:
    if b == 0:
        return Err("Division by zero")
    return Ok(a / b)

# Composing fallible operations
def calculate(a_str: str, b_str: str) -> Result[float, str]:
    a_result = parse_int(a_str)
    if isinstance(a_result, Err):
        return a_result
    
    b_result = parse_int(b_str)
    if isinstance(b_result, Err):
        return b_result
    
    return divide(a_result.value, b_result.value)

Result types shine in functional pipelines and when you want compile-time guarantees that errors are handled. They’re verbose in languages without pattern matching, but the explicitness pays off in complex systems.

Error Context and Observability

Raw stack traces aren’t enough. You need context: what operation was attempted, with what parameters, in what state?

import traceback
import uuid
from datetime import datetime
from typing import Optional, Any
from dataclasses import dataclass, field

@dataclass
class ErrorContext:
    operation: str
    correlation_id: str
    timestamp: datetime = field(default_factory=datetime.utcnow)
    details: dict = field(default_factory=dict)
    cause: Optional[Exception] = None
    
    def with_detail(self, key: str, value: Any) -> 'ErrorContext':
        new_details = {**self.details, key: value}
        return ErrorContext(
            operation=self.operation,
            correlation_id=self.correlation_id,
            timestamp=self.timestamp,
            details=new_details,
            cause=self.cause
        )

class ContextualError(Exception):
    def __init__(self, message: str, context: ErrorContext):
        self.context = context
        super().__init__(message)
    
    def to_log_dict(self) -> dict:
        return {
            'message': str(self),
            'operation': self.context.operation,
            'correlation_id': self.context.correlation_id,
            'timestamp': self.context.timestamp.isoformat(),
            'details': self.context.details,
            'cause': str(self.context.cause) if self.context.cause else None,
            'stack_trace': traceback.format_exc()
        }

# Usage with context enrichment
def process_order(order_id: str, correlation_id: str) -> None:
    ctx = ErrorContext(operation='process_order', correlation_id=correlation_id)
    ctx = ctx.with_detail('order_id', order_id)
    
    try:
        order = fetch_order(order_id)
        ctx = ctx.with_detail('customer_id', order.customer_id)
        
        inventory = check_inventory(order.items)
        ctx = ctx.with_detail('inventory_check', 'passed')
        
        charge_customer(order)
        
    except PaymentException as e:
        ctx = ErrorContext(
            operation=ctx.operation,
            correlation_id=ctx.correlation_id,
            details=ctx.details,
            cause=e
        )
        raise ContextualError(f"Payment failed for order {order_id}", ctx) from e

Retry Strategies and Circuit Breakers

Retries are essential for transient failures but dangerous when misconfigured. Always use exponential backoff with jitter to prevent thundering herds.

import random
import time
from typing import TypeVar, Callable, Optional
from dataclasses import dataclass
from enum import Enum

T = TypeVar('T')

class CircuitState(Enum):
    CLOSED = 'closed'      # Normal operation
    OPEN = 'open'          # Failing, reject requests
    HALF_OPEN = 'half_open'  # Testing if service recovered

@dataclass
class RetryConfig:
    max_attempts: int = 3
    base_delay_seconds: float = 1.0
    max_delay_seconds: float = 60.0
    exponential_base: float = 2.0
    jitter_factor: float = 0.1

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, recovery_timeout: float = 30.0):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.failure_count = 0
        self.last_failure_time: Optional[float] = None
        self.state = CircuitState.CLOSED
    
    def can_execute(self) -> bool:
        if self.state == CircuitState.CLOSED:
            return True
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                return True
            return False
        return True  # HALF_OPEN: allow one request through
    
    def record_success(self) -> None:
        self.failure_count = 0
        self.state = CircuitState.CLOSED
    
    def record_failure(self) -> None:
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

def retry_with_circuit_breaker(
    operation: Callable[[], T],
    config: RetryConfig,
    circuit: CircuitBreaker,
    retryable_exceptions: tuple = (Exception,)
) -> T:
    if not circuit.can_execute():
        raise CircuitOpenError("Circuit breaker is open")
    
    last_exception = None
    for attempt in range(config.max_attempts):
        try:
            result = operation()
            circuit.record_success()
            return result
        except retryable_exceptions as e:
            last_exception = e
            circuit.record_failure()
            
            if attempt < config.max_attempts - 1:
                delay = min(
                    config.base_delay_seconds * (config.exponential_base ** attempt),
                    config.max_delay_seconds
                )
                jitter = delay * config.jitter_factor * random.random()
                time.sleep(delay + jitter)
    
    raise last_exception

Designing Error-Resilient APIs

Consistent error responses make APIs predictable and debuggable:

from dataclasses import dataclass
from typing import Optional, List
from enum import Enum

@dataclass
class APIError:
    code: str
    message: str
    details: Optional[dict] = None
    field: Optional[str] = None

@dataclass
class APIErrorResponse:
    error: APIError
    request_id: str
    
    def to_dict(self) -> dict:
        response = {
            'error': {
                'code': self.error.code,
                'message': self.error.message,
            },
            'request_id': self.request_id
        }
        if self.error.details:
            response['error']['details'] = self.error.details
        if self.error.field:
            response['error']['field'] = self.error.field
        return response

# Map internal errors to API responses
def to_api_error(error: ApplicationError, request_id: str) -> tuple[int, dict]:
    if isinstance(error, ValidationError):
        return 400, APIErrorResponse(
            error=APIError(code=error.code, message=error.message, field=error.details.get('field')),
            request_id=request_id
        ).to_dict()
    elif isinstance(error, ResourceNotFoundError):
        return 404, APIErrorResponse(
            error=APIError(code=error.code, message=error.message),
            request_id=request_id
        ).to_dict()
    else:
        # Don't leak internal details for unexpected errors
        return 500, APIErrorResponse(
            error=APIError(code='INTERNAL_ERROR', message='An unexpected error occurred'),
            request_id=request_id
        ).to_dict()

Error handling isn’t glamorous work, but it’s what separates production-ready systems from prototypes. Invest the time upfront, and your future self—debugging at 2 AM—will thank you.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.