Design a Metrics and Monitoring System

Key Insights

A metrics system has five core components—collectors, time-series database, query layer, alerting engine, and dashboards—each with distinct scaling characteristics and failure modes you must design for independently.
The choice between push and pull collection models affects everything from service discovery to failure detection; pull-based systems like Prometheus excel at reliability, while push-based systems handle ephemeral workloads better.
Cardinality is the silent killer of metrics systems—a single high-cardinality label can explode storage costs and query latency by orders of magnitude, so enforce label governance from day one.

Why Metrics Matter

Observability rests on three pillars: metrics, logs, and traces. While logs tell you what happened and traces show you the path through your system, metrics answer the fundamental question: “Is my system healthy right now?”

Metrics are time-series data—numerical measurements collected at regular intervals. They’re compact, predictable in storage cost, and fast to query. When your pager goes off at 3 AM, you’re not grepping logs first. You’re looking at dashboards showing request rates, error percentages, and latency distributions.

Beyond incident response, metrics drive capacity planning, SLA reporting, and performance optimization. They’re the foundation of any serious production system. Let’s design one.

Core Architecture Components

A metrics system consists of five interconnected components:

Collectors/Agents run alongside your applications, gathering measurements and forwarding them to storage. These might be sidecars, embedded libraries, or standalone daemons.

Time-Series Database (TSDB) stores metrics efficiently, optimized for high write throughput and time-range queries. Examples include Prometheus, InfluxDB, and TimescaleDB.

Query Layer provides an interface for retrieving and aggregating data, typically with a specialized query language.

Alerting Engine evaluates rules against incoming data and triggers notifications when thresholds are breached.

Visualization Dashboard renders charts and graphs for human consumption—Grafana being the dominant player.

The data flows linearly: applications emit metrics → collectors aggregate and forward → TSDB stores → query layer serves both the alerting engine and dashboards.

Here’s the fundamental data model every metrics system shares:

type Metric struct {
    Name      string            // e.g., "http_requests_total"
    Labels    map[string]string // e.g., {"method": "GET", "status": "200"}
    Value     float64           // the measurement
    Timestamp int64             // Unix milliseconds
}

type TimeSeries struct {
    Name   string
    Labels map[string]string
    Points []DataPoint
}

type DataPoint struct {
    Timestamp int64
    Value     float64
}

Labels (or tags) are the key differentiator from simple metrics. They enable dimensional queries: “Show me the 99th percentile latency for POST requests to the /api/users endpoint in the us-east-1 region.” Without labels, you’d need a separate metric name for every combination.

Metrics Collection Patterns

Two models dominate metrics collection: pull and push.

Pull-based (Prometheus-style): The metrics server scrapes HTTP endpoints exposed by your applications. Your app maintains current metric values in memory; the collector fetches them periodically.

Push-based (StatsD-style): Applications actively send metrics to a collector. The collector aggregates them before forwarding to storage.

Pull works well when you control the infrastructure and have reliable service discovery. It provides built-in health checking—if a scrape fails, you know the target is down. Push works better for serverless functions, batch jobs, and environments where targets can’t expose HTTP endpoints.

Here’s practical instrumentation for an HTTP handler using a pull-based approach:

package middleware

import (
    "net/http"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "path", "status"},
    )
    
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request latency distribution",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "path"},
    )
)

func MetricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        
        // Wrap response writer to capture status code
        wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
        
        next.ServeHTTP(wrapped, r)
        
        duration := time.Since(start).Seconds()
        status := strconv.Itoa(wrapped.statusCode)
        
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, status).Inc()
        httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
    })
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

Notice the histogram buckets are predefined. This is critical—histograms in Prometheus are cumulative counters per bucket, not raw values. Choose buckets that match your SLO boundaries.

Time-Series Storage Design

Time-series databases face unique challenges. Write throughput is massive—thousands of metrics arriving every second. Reads are almost always time-bounded ranges. And data has natural expiration; last month’s per-second data rarely matters.

Modern TSDBs use variations of Log-Structured Merge trees (LSM) with time-based partitioning. Data arrives in memory buffers, gets periodically flushed to immutable blocks on disk, and those blocks are compacted over time.

Compression is essential. Time-series data compresses remarkably well because timestamps are sequential and values often change slowly. Gorilla compression (used by Prometheus) achieves 12x compression by storing deltas of deltas for timestamps and XOR of consecutive values.

Retention policies with downsampling keep storage costs manageable:

from dataclasses import dataclass
from typing import List
import time

@dataclass
class DataPoint:
    timestamp: int  # Unix seconds
    value: float

@dataclass  
class RetentionPolicy:
    name: str
    duration_seconds: int
    resolution_seconds: int

class TimeSeriesBucket:
    def __init__(self, policies: List[RetentionPolicy]):
        self.policies = sorted(policies, key=lambda p: p.resolution_seconds)
        self.data: dict[str, List[DataPoint]] = {p.name: [] for p in policies}
    
    def insert(self, point: DataPoint):
        # Insert into highest resolution bucket
        self.data[self.policies[0].name].append(point)
    
    def downsample(self):
        """Roll up data from fine-grained to coarse-grained buckets."""
        for i, policy in enumerate(self.policies[:-1]):
            next_policy = self.policies[i + 1]
            source_data = self.data[policy.name]
            
            # Group points by target resolution window
            windows: dict[int, List[float]] = {}
            for point in source_data:
                window_start = (point.timestamp // next_policy.resolution_seconds) * next_policy.resolution_seconds
                windows.setdefault(window_start, []).append(point.value)
            
            # Aggregate each window (using average here; could be sum, max, etc.)
            for window_start, values in windows.items():
                avg_value = sum(values) / len(values)
                self.data[next_policy.name].append(
                    DataPoint(timestamp=window_start, value=avg_value)
                )
    
    def expire(self, current_time: int):
        """Remove data older than retention allows."""
        for policy in self.policies:
            cutoff = current_time - policy.duration_seconds
            self.data[policy.name] = [
                p for p in self.data[policy.name] 
                if p.timestamp >= cutoff
            ]

# Example: Keep 15s resolution for 24h, 1m for 7d, 5m for 30d
policies = [
    RetentionPolicy("raw", duration_seconds=86400, resolution_seconds=15),
    RetentionPolicy("1m", duration_seconds=604800, resolution_seconds=60),
    RetentionPolicy("5m", duration_seconds=2592000, resolution_seconds=300),
]

Query and Aggregation Layer

Metrics queries follow predictable patterns: rates of change, percentile distributions, aggregations across label dimensions, and arithmetic combinations.

PromQL (Prometheus Query Language) has become the de facto standard. Understanding its core functions helps you design any query layer:

from typing import List, Optional
from dataclasses import dataclass

@dataclass
class Sample:
    timestamp: float
    value: float

def rate(samples: List[Sample], range_seconds: float) -> Optional[float]:
    """
    Calculate per-second rate of increase for a counter.
    
    This handles counter resets (when a process restarts and 
    the counter goes back to zero).
    """
    if len(samples) < 2:
        return None
    
    # Sort by timestamp
    sorted_samples = sorted(samples, key=lambda s: s.timestamp)
    
    total_increase = 0.0
    for i in range(1, len(sorted_samples)):
        prev, curr = sorted_samples[i-1], sorted_samples[i]
        
        if curr.value >= prev.value:
            # Normal case: counter increased
            total_increase += curr.value - prev.value
        else:
            # Counter reset: assume it went to zero and back up
            total_increase += curr.value
    
    time_span = sorted_samples[-1].timestamp - sorted_samples[0].timestamp
    if time_span == 0:
        return None
    
    return total_increase / time_span

def histogram_quantile(quantile: float, bucket_counts: dict[float, int]) -> float:
    """
    Calculate quantile from histogram bucket counts.
    
    bucket_counts: {le_boundary: cumulative_count}
    e.g., {0.1: 50, 0.5: 150, 1.0: 180, float('inf'): 200}
    """
    sorted_buckets = sorted(bucket_counts.items())
    total = sorted_buckets[-1][1]
    target = quantile * total
    
    prev_bound, prev_count = 0, 0
    for bound, count in sorted_buckets:
        if count >= target:
            # Linear interpolation within bucket
            fraction = (target - prev_count) / (count - prev_count)
            return prev_bound + (bound - prev_bound) * fraction
        prev_bound, prev_count = bound, count
    
    return sorted_buckets[-1][0]

Pre-computation matters at scale. Recording rules in Prometheus let you pre-calculate expensive queries and store the results as new metrics. If your dashboard shows the 99th percentile latency grouped by service, compute that every minute rather than on every page load.

Alerting System Design

Alerting converts metrics into action. A well-designed alerting system manages state transitions, avoids noise, and routes notifications intelligently.

# Alert rule configuration schema
groups:
  - name: api_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) 
          / sum(rate(http_requests_total[5m])) > 0.01          
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Error rate exceeds 1%"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

from enum import Enum
from dataclasses import dataclass, field
from typing import Optional
import time

class AlertState(Enum):
    INACTIVE = "inactive"
    PENDING = "pending"  
    FIRING = "firing"
    RESOLVED = "resolved"

@dataclass
class AlertRule:
    name: str
    expression: str  # Query to evaluate
    threshold: float
    for_duration: int  # Seconds condition must hold before firing
    labels: dict = field(default_factory=dict)

@dataclass
class AlertInstance:
    rule: AlertRule
    state: AlertState = AlertState.INACTIVE
    pending_since: Optional[float] = None
    firing_since: Optional[float] = None
    resolved_at: Optional[float] = None

class AlertEvaluator:
    def __init__(self):
        self.instances: dict[str, AlertInstance] = {}
    
    def evaluate(self, rule: AlertRule, current_value: float) -> AlertInstance:
        instance = self.instances.setdefault(
            rule.name, 
            AlertInstance(rule=rule)
        )
        now = time.time()
        threshold_exceeded = current_value > rule.threshold
        
        if instance.state == AlertState.INACTIVE:
            if threshold_exceeded:
                instance.state = AlertState.PENDING
                instance.pending_since = now
                
        elif instance.state == AlertState.PENDING:
            if not threshold_exceeded:
                instance.state = AlertState.INACTIVE
                instance.pending_since = None
            elif now - instance.pending_since >= rule.for_duration:
                instance.state = AlertState.FIRING
                instance.firing_since = now
                self._send_notification(instance, "firing")
                
        elif instance.state == AlertState.FIRING:
            if not threshold_exceeded:
                instance.state = AlertState.RESOLVED
                instance.resolved_at = now
                self._send_notification(instance, "resolved")
                
        return instance
    
    def _send_notification(self, instance: AlertInstance, status: str):
        # Route to appropriate channel based on labels
        print(f"Alert {instance.rule.name}: {status}")

The for duration is crucial—it prevents flapping. A brief spike shouldn’t wake anyone up. The state machine ensures you get exactly one “firing” notification and one “resolved” notification per incident.

Scaling Considerations and Trade-offs

Metrics systems scale horizontally through sharding and federation. Common strategies:

Shard by metric name hash: Distribute metrics across storage nodes based on consistent hashing of the metric name. Simple but creates hotspots if one metric has high cardinality.

Shard by label value: Route all metrics for a specific tenant, region, or service to the same shard. Enables better locality for queries.

Federation: Run independent metrics systems per cluster/region and aggregate at a global layer. Prometheus federation pulls selected metrics from downstream servers.

The biggest threat to any metrics system is cardinality explosion. Each unique combination of metric name and label values creates a new time series. If you add a user_id label to request metrics, you suddenly have millions of series instead of hundreds.

Enforce these rules:

Never use unbounded values as labels (user IDs, request IDs, timestamps)
Set cardinality limits per metric
Monitor your metrics system with metrics (meta, but essential)

The trade-off is always granularity versus cost. Per-second data for 90 days costs 90x more than per-minute data. Define retention based on actual query patterns—most teams never look at raw data older than a week.

Build your metrics system with these constraints in mind, and you’ll have observability that scales with your infrastructure rather than against it.