Infrastructure Monitoring: Uptime and Performance
Infrastructure monitoring isn't optional anymore. When your application goes down at 3 AM, monitoring is what tells you about it before your customers flood support channels. More importantly, good...
Key Insights
- Effective infrastructure monitoring requires tracking both availability metrics (uptime, error rates) and performance metrics (response time, throughput, resource utilization) to maintain reliable services
- Implement comprehensive health checks at multiple layers—from simple HTTP endpoints to deep database connectivity probes—and configure them appropriately for different use cases (liveness vs. readiness)
- Choose monitoring tools based on your infrastructure model (push vs. pull), set intelligent alert thresholds to prevent fatigue, and build dashboards using proven methodologies like RED (Rate, Errors, Duration) or USE (Utilization, Saturation, Errors)
Introduction to Infrastructure Monitoring
Infrastructure monitoring isn’t optional anymore. When your application goes down at 3 AM, monitoring is what tells you about it before your customers flood support channels. More importantly, good monitoring catches issues before they become outages.
Uptime monitoring answers a binary question: is your service available? Performance monitoring digs deeper: how well is it working? You need both. A service that’s “up” but responding in 30 seconds is effectively down for most users.
These metrics directly feed into your Service Level Agreements (SLAs) and Service Level Objectives (SLOs). If you’ve promised 99.9% uptime, you need monitoring to prove you’re delivering it. SLOs for response time, error rates, and throughput give your team concrete targets and early warning when you’re drifting off course.
Key Metrics to Track
Start with these fundamental metrics:
Uptime percentage: The proportion of time your service is available. Calculate it over meaningful windows (daily, weekly, monthly). Remember that 99.9% uptime still means 43 minutes of downtime per month.
Response time: How long requests take to complete. Track percentiles (p50, p95, p99), not just averages. An average of 200ms means nothing if your p99 is 10 seconds.
Throughput: Requests per second your system handles. This helps you understand capacity and spot traffic patterns.
Error rates: Percentage of requests that fail. Break this down by error type (4xx vs 5xx) to distinguish client errors from server problems.
Resource utilization: CPU, memory, disk I/O, and network usage. These predict capacity issues before they impact users.
Here’s a Python script using psutil to collect system metrics:
import psutil
import time
import json
def collect_system_metrics():
"""Collect and return system metrics as a dictionary."""
cpu_percent = psutil.cpu_percent(interval=1)
memory = psutil.virtual_memory()
disk = psutil.disk_usage('/')
network = psutil.net_io_counters()
metrics = {
'timestamp': int(time.time()),
'cpu_percent': cpu_percent,
'memory_percent': memory.percent,
'memory_available_mb': memory.available / (1024 * 1024),
'disk_percent': disk.percent,
'disk_free_gb': disk.free / (1024 * 1024 * 1024),
'network_bytes_sent': network.bytes_sent,
'network_bytes_recv': network.bytes_recv
}
return metrics
def monitor_loop(interval=60):
"""Continuously collect metrics at specified interval."""
while True:
metrics = collect_system_metrics()
print(json.dumps(metrics, indent=2))
# Send to your monitoring backend here
# send_to_monitoring_system(metrics)
time.sleep(interval)
if __name__ == '__main__':
monitor_loop(interval=60)
Health Check Implementations
Health checks are your application’s way of saying “I’m okay” or “something’s wrong.” Implement multiple types:
Liveness probes: Is the application running? A failing liveness probe should trigger a restart.
Readiness probes: Is the application ready to serve traffic? Failing readiness removes the instance from load balancers without killing it.
Startup probes: Special case for slow-starting applications. Prevents premature liveness failures during initialization.
Here’s a robust Express.js health endpoint:
const express = require('express');
const { Pool } = require('pg');
const app = express();
const pool = new Pool({
connectionString: process.env.DATABASE_URL
});
// Basic liveness check
app.get('/health/live', (req, res) => {
res.status(200).json({ status: 'alive' });
});
// Comprehensive readiness check
app.get('/health/ready', async (req, res) => {
const checks = {
database: false,
memory: false
};
// Check database connectivity
try {
const client = await pool.connect();
await client.query('SELECT 1');
client.release();
checks.database = true;
} catch (err) {
console.error('Database health check failed:', err);
}
// Check memory usage
const memUsage = process.memoryUsage();
const memPercent = (memUsage.heapUsed / memUsage.heapTotal) * 100;
checks.memory = memPercent < 90;
const isHealthy = Object.values(checks).every(check => check === true);
const status = isHealthy ? 200 : 503;
res.status(status).json({
status: isHealthy ? 'ready' : 'not ready',
checks
});
});
app.listen(3000);
Configure these in Kubernetes:
apiVersion: v1
kind: Pod
metadata:
name: myapp
spec:
containers:
- name: myapp
image: myapp:latest
livenessProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 2
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
failureThreshold: 2
startupProbe:
httpGet:
path: /health/live
port: 3000
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30
Monitoring Tools and Integration
Prometheus + Grafana is the de facto standard for Kubernetes environments. Prometheus uses a pull model, scraping metrics from your applications. It’s open source and highly flexible.
Datadog and New Relic are commercial SaaS solutions with powerful features and less operational overhead. They use push models where agents send metrics to their cloud.
CloudWatch is the natural choice for AWS infrastructure, tightly integrated with all AWS services.
Choose based on your infrastructure, budget, and team expertise. For most teams, I recommend starting with Prometheus if you’re on Kubernetes, or CloudWatch if you’re AWS-native.
Here’s how to expose Prometheus metrics in Go:
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
For AWS CloudWatch with Python:
import boto3
from datetime import datetime
cloudwatch = boto3.client('cloudwatch')
def put_custom_metric(metric_name, value, unit='Count'):
"""Send custom metric to CloudWatch."""
cloudwatch.put_metric_data(
Namespace='MyApplication',
MetricData=[
{
'MetricName': metric_name,
'Value': value,
'Unit': unit,
'Timestamp': datetime.utcnow(),
'Dimensions': [
{
'Name': 'Environment',
'Value': 'production'
}
]
}
]
)
# Example usage
put_custom_metric('OrdersProcessed', 150, 'Count')
put_custom_metric('ResponseTime', 0.245, 'Seconds')
Alerting Strategies
Alerts wake people up. Make them count. Bad alerts train your team to ignore monitoring, which defeats the entire purpose.
Set thresholds based on actual impact. Alert when users are affected or about to be affected, not on arbitrary metric values. If CPU hits 80%, that’s interesting. If response time degrades because of it, that’s an alert.
Prevent alert fatigue by:
- Using multiple thresholds (warning vs. critical)
- Requiring sustained conditions, not momentary spikes
- Grouping related alerts
- Implementing quiet periods for known maintenance
Here’s a Prometheus AlertManager rule:
groups:
- name: performance_alerts
interval: 30s
rules:
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "95th percentile response time is {{ $value }}s (threshold: 1.0s)"
- alert: CriticalResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 3.0
for: 2m
labels:
severity: critical
annotations:
summary: "Critical response time detected"
description: "95th percentile response time is {{ $value }}s (threshold: 3.0s)"
Dashboards and Visualization
Dashboards should answer questions, not just display data. Use the RED method for request-driven services:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Time per request
For resource monitoring, use the USE method:
- Utilization: Percentage of time the resource is busy
- Saturation: Queue depth or wait time
- Errors: Error count
Keep dashboards focused. One dashboard per service or component. Put the most critical metrics at the top.
Here’s a basic Grafana dashboard configuration:
{
"dashboard": {
"title": "API Performance",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{endpoint}}"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
"legendFormat": "5xx errors"
}
],
"type": "graph"
},
{
"title": "Response Time (p95)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "p95"
}
],
"type": "graph"
}
]
}
}
Incident Response and Post-Mortems
When incidents happen, monitoring data is your primary investigative tool. Good monitoring tells you what broke and when. Great monitoring tells you why.
During incidents, focus on restoration first, investigation second. Use monitoring to identify what changed and what’s currently broken. Time-series data shows you the exact moment things went wrong.
After incidents, conduct blameless post-mortems. Review monitoring data to build a timeline. Identify gaps in monitoring that delayed detection or diagnosis. Add new metrics and alerts based on what you learned.
The goal isn’t perfect monitoring—it’s continuous improvement. Each incident teaches you what matters and what doesn’t. Adjust your monitoring accordingly.
Infrastructure monitoring is an ongoing practice, not a one-time setup. Start with the basics, instrument thoroughly, alert intelligently, and iterate based on real incidents. Your future self (and your on-call team) will thank you.