Error Handling: Strategies and Best Practices
Poor error handling costs more than most teams realize. It manifests as data corruption when partial operations complete without rollback, security vulnerabilities when error messages leak internal...
Key Insights
- Error handling is a design decision, not an afterthought—choose between fail-fast and fail-safe strategies based on your system’s requirements and failure domains
- Result types make errors explicit in your function signatures, eliminating the hidden control flow that exceptions create and forcing callers to handle failure cases
- Every error should carry enough context for debugging while every retry strategy needs limits—unbounded retries and context-free errors are two of the fastest paths to production incidents
Introduction: Why Error Handling Matters
Poor error handling costs more than most teams realize. It manifests as data corruption when partial operations complete without rollback, security vulnerabilities when error messages leak internal details, and terrible user experiences when systems fail silently or with cryptic messages.
Before diving into strategies, let’s clarify terminology. An error is an incorrect or unexpected condition in your program. An exception is a language mechanism for handling errors by transferring control up the call stack. A failure is when a system can no longer fulfill its intended function. These distinctions matter because different situations call for different responses.
The goal of error handling isn’t to prevent all errors—that’s impossible. The goal is to maintain system integrity, provide useful feedback, and enable recovery when things go wrong.
Fail Fast vs. Fail Safe Strategies
The fail-fast philosophy says: when something goes wrong, stop immediately and loudly. Don’t try to continue with potentially corrupted state. This approach prevents cascading failures and makes bugs easier to find.
Fail-safe takes the opposite stance: when something goes wrong, degrade gracefully and keep the system running. This prioritizes availability over correctness.
Neither approach is universally correct. The right choice depends on context:
Fail fast when:
- You’re at a system boundary (API input, file parsing, configuration loading)
- Data integrity is paramount
- Continuing would make the problem worse
- You’re in development or testing environments
Fail safe when:
- Partial functionality is better than none
- The failing component is non-critical
- You have fallback mechanisms
- User experience depends on availability
# Fail fast: Input validation at system boundaries
from dataclasses import dataclass
from typing import Optional
import re
class ValidationError(Exception):
def __init__(self, field: str, message: str):
self.field = field
self.message = message
super().__init__(f"{field}: {message}")
@dataclass
class CreateUserRequest:
email: str
password: str
age: Optional[int] = None
def validate(self) -> None:
"""Fail fast on invalid input - don't let bad data propagate."""
if not self.email or not re.match(r'^[^@]+@[^@]+\.[^@]+$', self.email):
raise ValidationError('email', 'Invalid email format')
if not self.password or len(self.password) < 12:
raise ValidationError('password', 'Password must be at least 12 characters')
if self.age is not None and (self.age < 0 or self.age > 150):
raise ValidationError('age', 'Age must be between 0 and 150')
# Usage: validate at the boundary, trust internally
def create_user(request: CreateUserRequest) -> User:
request.validate() # Fails fast here if invalid
# From this point, we trust the data is valid
return user_repository.create(request)
For APIs, fail fast on bad input but fail safe on non-critical downstream services. For background jobs, fail fast and let the job scheduler handle retries. For user-facing apps, fail safe with graceful degradation whenever possible.
Exception Handling Patterns
The most common exception handling mistake is catching too broadly. Generic catch blocks hide bugs and make debugging nightmares.
# Bad: Swallows all errors including programming mistakes
try:
process_payment(order)
except Exception:
logger.error("Payment failed")
return None # What actually happened?
# Good: Catch specific exceptions, handle appropriately
try:
process_payment(order)
except PaymentDeclinedException as e:
notify_user_payment_declined(order, e.decline_reason)
return PaymentResult.declined(e.decline_reason)
except PaymentGatewayTimeoutException as e:
schedule_payment_retry(order)
return PaymentResult.pending()
except InsufficientFundsException as e:
return PaymentResult.insufficient_funds()
# Let unexpected exceptions propagate - they're bugs that need fixing
Build an exception hierarchy that reflects your domain:
class ApplicationError(Exception):
"""Base class for all application-specific errors."""
def __init__(self, message: str, code: str, details: dict = None):
self.message = message
self.code = code
self.details = details or {}
super().__init__(message)
class ValidationError(ApplicationError):
"""Input validation failures."""
def __init__(self, field: str, message: str):
super().__init__(
message=message,
code='VALIDATION_ERROR',
details={'field': field}
)
class BusinessRuleViolation(ApplicationError):
"""Domain logic violations."""
pass
class ResourceNotFoundError(ApplicationError):
"""Requested resource doesn't exist."""
def __init__(self, resource_type: str, resource_id: str):
super().__init__(
message=f"{resource_type} with id '{resource_id}' not found",
code='RESOURCE_NOT_FOUND',
details={'resource_type': resource_type, 'resource_id': resource_id}
)
class ExternalServiceError(ApplicationError):
"""Failures in external dependencies."""
def __init__(self, service: str, message: str, retryable: bool = False):
super().__init__(
message=f"{service}: {message}",
code='EXTERNAL_SERVICE_ERROR',
details={'service': service, 'retryable': retryable}
)
The decision tree for exception handling is simple: Can you meaningfully handle this error? If yes, handle it. If no, let it propagate. “Handling” means taking corrective action, not just logging and continuing.
Result Types and Explicit Error Handling
Exceptions create hidden control flow. A function that throws can exit at any point, and nothing in its signature tells you this. Result types make errors explicit.
from dataclasses import dataclass
from typing import TypeVar, Generic, Callable, Union
T = TypeVar('T')
E = TypeVar('E')
U = TypeVar('U')
@dataclass
class Ok(Generic[T]):
value: T
def is_ok(self) -> bool:
return True
def is_err(self) -> bool:
return False
@dataclass
class Err(Generic[E]):
error: E
def is_ok(self) -> bool:
return False
def is_err(self) -> bool:
return True
Result = Union[Ok[T], Err[E]]
def map_result(result: Result[T, E], f: Callable[[T], U]) -> Result[U, E]:
"""Transform the success value, pass through errors."""
if isinstance(result, Ok):
return Ok(f(result.value))
return result
def flat_map_result(result: Result[T, E], f: Callable[[T], Result[U, E]]) -> Result[U, E]:
"""Chain operations that might fail."""
if isinstance(result, Ok):
return f(result.value)
return result
# Usage
def parse_int(s: str) -> Result[int, str]:
try:
return Ok(int(s))
except ValueError:
return Err(f"Cannot parse '{s}' as integer")
def divide(a: int, b: int) -> Result[float, str]:
if b == 0:
return Err("Division by zero")
return Ok(a / b)
# Composing fallible operations
def calculate(a_str: str, b_str: str) -> Result[float, str]:
a_result = parse_int(a_str)
if isinstance(a_result, Err):
return a_result
b_result = parse_int(b_str)
if isinstance(b_result, Err):
return b_result
return divide(a_result.value, b_result.value)
Result types shine in functional pipelines and when you want compile-time guarantees that errors are handled. They’re verbose in languages without pattern matching, but the explicitness pays off in complex systems.
Error Context and Observability
Raw stack traces aren’t enough. You need context: what operation was attempted, with what parameters, in what state?
import traceback
import uuid
from datetime import datetime
from typing import Optional, Any
from dataclasses import dataclass, field
@dataclass
class ErrorContext:
operation: str
correlation_id: str
timestamp: datetime = field(default_factory=datetime.utcnow)
details: dict = field(default_factory=dict)
cause: Optional[Exception] = None
def with_detail(self, key: str, value: Any) -> 'ErrorContext':
new_details = {**self.details, key: value}
return ErrorContext(
operation=self.operation,
correlation_id=self.correlation_id,
timestamp=self.timestamp,
details=new_details,
cause=self.cause
)
class ContextualError(Exception):
def __init__(self, message: str, context: ErrorContext):
self.context = context
super().__init__(message)
def to_log_dict(self) -> dict:
return {
'message': str(self),
'operation': self.context.operation,
'correlation_id': self.context.correlation_id,
'timestamp': self.context.timestamp.isoformat(),
'details': self.context.details,
'cause': str(self.context.cause) if self.context.cause else None,
'stack_trace': traceback.format_exc()
}
# Usage with context enrichment
def process_order(order_id: str, correlation_id: str) -> None:
ctx = ErrorContext(operation='process_order', correlation_id=correlation_id)
ctx = ctx.with_detail('order_id', order_id)
try:
order = fetch_order(order_id)
ctx = ctx.with_detail('customer_id', order.customer_id)
inventory = check_inventory(order.items)
ctx = ctx.with_detail('inventory_check', 'passed')
charge_customer(order)
except PaymentException as e:
ctx = ErrorContext(
operation=ctx.operation,
correlation_id=ctx.correlation_id,
details=ctx.details,
cause=e
)
raise ContextualError(f"Payment failed for order {order_id}", ctx) from e
Retry Strategies and Circuit Breakers
Retries are essential for transient failures but dangerous when misconfigured. Always use exponential backoff with jitter to prevent thundering herds.
import random
import time
from typing import TypeVar, Callable, Optional
from dataclasses import dataclass
from enum import Enum
T = TypeVar('T')
class CircuitState(Enum):
CLOSED = 'closed' # Normal operation
OPEN = 'open' # Failing, reject requests
HALF_OPEN = 'half_open' # Testing if service recovered
@dataclass
class RetryConfig:
max_attempts: int = 3
base_delay_seconds: float = 1.0
max_delay_seconds: float = 60.0
exponential_base: float = 2.0
jitter_factor: float = 0.1
class CircuitBreaker:
def __init__(self, failure_threshold: int = 5, recovery_timeout: float = 30.0):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.failure_count = 0
self.last_failure_time: Optional[float] = None
self.state = CircuitState.CLOSED
def can_execute(self) -> bool:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
return True
return False
return True # HALF_OPEN: allow one request through
def record_success(self) -> None:
self.failure_count = 0
self.state = CircuitState.CLOSED
def record_failure(self) -> None:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
def retry_with_circuit_breaker(
operation: Callable[[], T],
config: RetryConfig,
circuit: CircuitBreaker,
retryable_exceptions: tuple = (Exception,)
) -> T:
if not circuit.can_execute():
raise CircuitOpenError("Circuit breaker is open")
last_exception = None
for attempt in range(config.max_attempts):
try:
result = operation()
circuit.record_success()
return result
except retryable_exceptions as e:
last_exception = e
circuit.record_failure()
if attempt < config.max_attempts - 1:
delay = min(
config.base_delay_seconds * (config.exponential_base ** attempt),
config.max_delay_seconds
)
jitter = delay * config.jitter_factor * random.random()
time.sleep(delay + jitter)
raise last_exception
Designing Error-Resilient APIs
Consistent error responses make APIs predictable and debuggable:
from dataclasses import dataclass
from typing import Optional, List
from enum import Enum
@dataclass
class APIError:
code: str
message: str
details: Optional[dict] = None
field: Optional[str] = None
@dataclass
class APIErrorResponse:
error: APIError
request_id: str
def to_dict(self) -> dict:
response = {
'error': {
'code': self.error.code,
'message': self.error.message,
},
'request_id': self.request_id
}
if self.error.details:
response['error']['details'] = self.error.details
if self.error.field:
response['error']['field'] = self.error.field
return response
# Map internal errors to API responses
def to_api_error(error: ApplicationError, request_id: str) -> tuple[int, dict]:
if isinstance(error, ValidationError):
return 400, APIErrorResponse(
error=APIError(code=error.code, message=error.message, field=error.details.get('field')),
request_id=request_id
).to_dict()
elif isinstance(error, ResourceNotFoundError):
return 404, APIErrorResponse(
error=APIError(code=error.code, message=error.message),
request_id=request_id
).to_dict()
else:
# Don't leak internal details for unexpected errors
return 500, APIErrorResponse(
error=APIError(code='INTERNAL_ERROR', message='An unexpected error occurred'),
request_id=request_id
).to_dict()
Error handling isn’t glamorous work, but it’s what separates production-ready systems from prototypes. Invest the time upfront, and your future self—debugging at 2 AM—will thank you.