Prometheus: Metrics Collection and Alerting
Prometheus is an open-source monitoring system built specifically for dynamic cloud environments. Unlike traditional monitoring tools that rely on agents pushing metrics to a central server,...
Key Insights
- Prometheus uses a pull-based model where the server scrapes metrics from instrumented targets, making it simpler to operate than push-based systems and providing built-in service discovery
- The four metric types (Counter, Gauge, Histogram, Summary) serve distinct purposes—use Counters for cumulative values, Gauges for snapshots, and Histograms for latency distributions
- Alert fatigue is real—design alerting rules around symptoms users experience rather than component failures, and use recording rules to pre-aggregate expensive queries for dashboard performance
Introduction to Prometheus Architecture
Prometheus is an open-source monitoring system built specifically for dynamic cloud environments. Unlike traditional monitoring tools that rely on agents pushing metrics to a central server, Prometheus pulls metrics from HTTP endpoints exposed by your applications and infrastructure components.
The architecture consists of four main components. The Prometheus server scrapes and stores time-series data, executing queries and evaluating alerting rules. Exporters expose metrics from third-party systems like databases, message queues, and hardware. The Pushgateway handles metrics from short-lived jobs that don’t exist long enough to be scraped. Alertmanager receives alerts from Prometheus and handles deduplication, grouping, and routing to notification channels.
The time-series database stores metrics as streams of timestamped values identified by metric names and key-value labels. This dimensional data model enables powerful querying and aggregation across multiple label dimensions without requiring schema changes.
Setting Up Prometheus
For production deployments, run Prometheus in Docker or Kubernetes. The official Docker image requires minimal configuration:
docker run -p 9090:9090 \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
For Kubernetes, use the Prometheus Operator which provides custom resources for managing Prometheus deployments, service monitors, and alert rules.
The prometheus.yml configuration file defines global settings, scrape targets, and alerting rules:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production-us-east-1'
environment: 'prod'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# Scrape configurations
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'api-servers'
static_configs:
- targets:
- 'api-1.example.com:8080'
- 'api-2.example.com:8080'
- 'api-3.example.com:8080'
scrape_interval: 10s
metrics_path: '/metrics'
- job_name: 'postgres-exporter'
static_configs:
- targets: ['postgres-exporter:9187']
# Storage retention
storage:
tsdb:
retention.time: 30d
retention.size: 50GB
Set retention policies based on your storage capacity and query patterns. Most teams keep 15-30 days of raw metrics and use recording rules or remote storage for long-term data.
Instrumenting Applications for Metrics
Prometheus provides client libraries for Go, Java, Python, Ruby, and other languages. Expose metrics on an HTTP endpoint (typically /metrics) that Prometheus scrapes.
Here’s a Go application instrumented with the Prometheus client:
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency distributions",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
)
func instrumentHandler(endpoint string, handler http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
activeConnections.Inc()
defer activeConnections.Dec()
handler(w, r)
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, endpoint, "200").Inc()
}
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/api/users", instrumentHandler("/api/users", handleUsers))
http.ListenAndServe(":8080", nil)
}
func handleUsers(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("Users endpoint"))
}
For Python applications using Flask:
from flask import Flask
from prometheus_client import Counter, Histogram, Gauge, generate_latest
import time
app = Flask(__name__)
# Define metrics
order_total = Counter(
'orders_total',
'Total number of orders',
['product_type', 'status']
)
order_value = Histogram(
'order_value_dollars',
'Order value in dollars',
buckets=[10, 50, 100, 500, 1000, 5000]
)
inventory_items = Gauge(
'inventory_items',
'Current inventory count',
['product_id']
)
@app.route('/api/order', methods=['POST'])
def create_order():
# Business logic
product_type = 'electronics'
value = 299.99
# Update metrics
order_total.labels(product_type=product_type, status='completed').inc()
order_value.observe(value)
inventory_items.labels(product_id='12345').dec()
return {'status': 'success'}
@app.route('/metrics')
def metrics():
return generate_latest()
Use Counters for cumulative values that only increase (requests, errors, sales). Use Gauges for values that go up and down (memory usage, queue depth, temperature). Use Histograms for distributions (request latency, response size).
Writing PromQL Queries
PromQL (Prometheus Query Language) retrieves and transforms time-series data. Master these common patterns:
# CPU usage percentage across all instances
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Request rate per second
rate(http_requests_total[5m])
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# 95th percentile latency
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
# Memory usage by pod
sum by(pod) (container_memory_working_set_bytes{namespace="production"})
# Top 5 endpoints by request count
topk(5, sum by(endpoint) (rate(http_requests_total[1h])))
# Aggregate across multiple labels
sum without(instance, pod) (up{job="api-servers"})
The rate() function calculates per-second rate over a time window—essential for counters. Always use range vectors (e.g., [5m]) with rate(). The histogram_quantile() function computes percentiles from histogram buckets.
Configuring Alerting Rules
Define alerting rules in separate YAML files referenced from prometheus.yml:
# alerts.yml
groups:
- name: api_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
component: api
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.endpoint }}"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: DiskSpaceRunningOut
expr: |
(node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.10
for: 30m
labels:
severity: warning
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"
Configure Alertmanager to route alerts to appropriate channels:
# alertmanager.yml
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
receiver: 'default'
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
- name: 'slack-warnings'
slack_configs:
- channel: '#monitoring'
The for clause prevents flapping alerts by requiring conditions to persist before firing. Group related alerts to reduce notification noise.
Service Discovery and Scalability
Kubernetes service discovery automatically discovers pods and services:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom metrics path if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Use custom port if specified
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Add namespace and pod name as labels
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
Annotate your pods to enable scraping:
apiVersion: v1
kind: Pod
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
Best Practices and Production Considerations
Avoid high cardinality labels like user IDs or request IDs—they explode the number of time series. Keep label cardinality under 10 values per label when possible.
Use recording rules to pre-compute expensive queries:
# recording_rules.yml
groups:
- name: performance_rules
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by(job) (rate(http_requests_total[5m]))
- record: job:http_request_duration_seconds:p95
expr: |
histogram_quantile(0.95,
sum by(job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- record: instance:node_cpu:utilization
expr: |
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Recording rules reduce dashboard load times and enable faster alerting on complex queries.
Monitor Prometheus itself by scraping its own /metrics endpoint. Watch prometheus_tsdb_head_series for cardinality issues and prometheus_rule_evaluation_failures_total for broken rules.
Set up remote storage for long-term retention using Thanos, Cortex, or cloud-managed solutions. The local TSDB works well for short-term data but doesn’t scale across multiple Prometheus instances.
Prometheus excels at monitoring dynamic infrastructure with its pull-based model and powerful query language. Instrument your applications early, keep cardinality low, and design alerts around user impact rather than component metrics.