Design a Real-Time Analytics Dashboard

Key Insights

Real-time dashboards require a layered architecture where each component handles a specific concern—ingestion, processing, storage, and delivery—with clear contracts between them.
Pre-aggregation is non-negotiable at scale; querying raw events for every dashboard refresh will bankrupt your infrastructure budget and frustrate your users.
WebSockets provide the best developer experience for bidirectional real-time updates, but Server-Sent Events are simpler and sufficient for most dashboard scenarios.

Introduction & Use Cases

Real-time analytics dashboards power critical decision-making across industries. DevOps teams monitor application health, trading desks track market movements, and operations centers watch IoT sensor networks. The common thread: humans need to see what’s happening now, not five minutes ago.

The architectural challenges are substantial. You’re balancing competing concerns: sub-second latency, horizontal scalability, data accuracy, and cost efficiency. A naive implementation—polling a database every second—collapses under load. A sophisticated one requires careful orchestration of streaming infrastructure, time-series storage, and efficient data delivery mechanisms.

This article walks through a production-ready architecture, with code examples you can adapt to your stack.

Data Ingestion Layer

Every real-time system starts with getting data in. Message queues decouple producers from consumers, absorb traffic spikes, and provide replay capability when things go wrong.

Kafka remains the industry standard for high-throughput event ingestion. Your partitioning strategy directly impacts downstream parallelism and ordering guarantees.

# producer.py - Event ingestion with intelligent partitioning
from confluent_kafka import Producer
import json
import hashlib

class MetricsProducer:
    def __init__(self, bootstrap_servers: str):
        self.producer = Producer({
            'bootstrap.servers': bootstrap_servers,
            'linger.ms': 5,  # Batch for 5ms to improve throughput
            'batch.size': 16384,
            'compression.type': 'lz4'
        })
    
    def send_metric(self, metric: dict):
        # Partition by source to maintain ordering per-source
        # while distributing load across partitions
        partition_key = metric['source_id']
        
        self.producer.produce(
            topic='metrics-raw',
            key=partition_key.encode('utf-8'),
            value=json.dumps(metric).encode('utf-8'),
            callback=self._delivery_callback
        )
    
    def _delivery_callback(self, err, msg):
        if err:
            # Log to dead letter queue for retry
            print(f"Delivery failed: {err}")

# Consumer with parallel processing
from confluent_kafka import Consumer

def create_consumer(group_id: str, topics: list) -> Consumer:
    return Consumer({
        'bootstrap.servers': 'kafka:9092',
        'group.id': group_id,
        'auto.offset.reset': 'latest',
        'enable.auto.commit': False,  # Manual commit for exactly-once
        'max.poll.interval.ms': 300000
    })

Key decisions here: partitioning by source_id ensures events from the same source stay ordered while allowing parallel consumption. Disabling auto-commit gives you control over exactly-once semantics. The linger.ms setting trades a few milliseconds of latency for significantly better throughput through batching.

Stream Processing & Aggregation

Raw events are useless for dashboards. Nobody wants to see individual page views—they want requests per second, 95th percentile latency, error rates. Stream processing transforms the firehose into meaningful metrics.

Windowed aggregations are the core primitive. Tumbling windows provide non-overlapping time buckets (perfect for “events per minute”). Sliding windows give smoother visualizations at the cost of more computation.

// Flink stream processing job with windowed aggregations
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.assigners.*;

public class MetricsAggregator {
    
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = 
            StreamExecutionEnvironment.getExecutionEnvironment();
        
        // Enable checkpointing for fault tolerance
        env.enableCheckpointing(10000);
        
        DataStream<MetricEvent> events = env
            .addSource(new FlinkKafkaConsumer<>(
                "metrics-raw",
                new MetricEventSchema(),
                kafkaProperties
            ));
        
        // Tumbling window: aggregate every 10 seconds
        DataStream<AggregatedMetric> tenSecondAggregates = events
            .keyBy(event -> event.getMetricName() + ":" + event.getSourceId())
            .window(TumblingEventTimeWindows.of(Time.seconds(10)))
            .aggregate(new MetricAggregateFunction())
            .name("10-second-aggregates");
        
        // Sliding window for smoother visualization
        // 1-minute window, sliding every 10 seconds
        DataStream<AggregatedMetric> slidingAggregates = events
            .keyBy(event -> event.getMetricName())
            .window(SlidingEventTimeWindows.of(
                Time.minutes(1), 
                Time.seconds(10)
            ))
            .aggregate(new PercentileAggregateFunction())
            .name("sliding-percentiles");
        
        // Sink to TimescaleDB
        tenSecondAggregates.addSink(new TimescaleDBSink());
        slidingAggregates.addSink(new TimescaleDBSink());
        
        env.execute("Metrics Aggregation Pipeline");
    }
}

// Custom aggregate function for percentiles
public class PercentileAggregateFunction 
    implements AggregateFunction<MetricEvent, TDigest, AggregatedMetric> {
    
    @Override
    public TDigest createAccumulator() {
        return TDigest.createDigest(100);  // Compression factor
    }
    
    @Override
    public TDigest add(MetricEvent event, TDigest accumulator) {
        accumulator.add(event.getValue());
        return accumulator;
    }
    
    @Override
    public AggregatedMetric getResult(TDigest accumulator) {
        return new AggregatedMetric(
            accumulator.quantile(0.50),  // p50
            accumulator.quantile(0.95),  // p95
            accumulator.quantile(0.99)   // p99
        );
    }
}

The T-Digest algorithm deserves special mention—it’s a probabilistic data structure that computes accurate percentiles with fixed memory, essential for streaming scenarios where you can’t hold all values.

Storage Layer Design

Time-series databases are purpose-built for this workload. They optimize for append-heavy writes, time-range queries, and automatic data lifecycle management.

TimescaleDB offers the best balance of PostgreSQL compatibility and time-series performance. Continuous aggregates push computation to write time, making reads nearly instantaneous.

-- TimescaleDB schema with continuous aggregates

-- Raw metrics table (write-optimized)
CREATE TABLE metrics (
    time        TIMESTAMPTZ NOT NULL,
    source_id   TEXT NOT NULL,
    metric_name TEXT NOT NULL,
    value       DOUBLE PRECISION,
    tags        JSONB
);

-- Convert to hypertable with 1-hour chunks
SELECT create_hypertable('metrics', 'time', 
    chunk_time_interval => INTERVAL '1 hour');

-- Create indexes for common query patterns
CREATE INDEX idx_metrics_source_time 
    ON metrics (source_id, time DESC);
CREATE INDEX idx_metrics_name_time 
    ON metrics (metric_name, time DESC);

-- Continuous aggregate for 1-minute rollups
CREATE MATERIALIZED VIEW metrics_1m
WITH (timescaledb.continuous) AS
SELECT 
    time_bucket('1 minute', time) AS bucket,
    source_id,
    metric_name,
    COUNT(*) as count,
    AVG(value) as avg_value,
    MIN(value) as min_value,
    MAX(value) as max_value,
    percentile_cont(0.95) WITHIN GROUP (ORDER BY value) as p95
FROM metrics
GROUP BY bucket, source_id, metric_name;

-- Refresh policy: update every 30 seconds
SELECT add_continuous_aggregate_policy('metrics_1m',
    start_offset => INTERVAL '10 minutes',
    end_offset => INTERVAL '1 minute',
    schedule_interval => INTERVAL '30 seconds');

-- Retention policy: drop raw data after 7 days
SELECT add_retention_policy('metrics', INTERVAL '7 days');

-- Keep aggregates for 90 days
SELECT add_retention_policy('metrics_1m', INTERVAL '90 days');

This schema implements hot/warm/cold tiering implicitly. Recent raw data serves detailed drill-downs. Aggregates serve dashboard queries. Old data disappears automatically.

Query & API Layer

Your API layer bridges storage and frontend. GraphQL subscriptions provide an elegant model for real-time updates—clients declare what they need, and the server pushes changes.

// GraphQL schema and resolvers with Redis caching
import { PubSub } from 'graphql-subscriptions';
import Redis from 'ioredis';

const pubsub = new PubSub();
const redis = new Redis(process.env.REDIS_URL);

const typeDefs = `
  type MetricPoint {
    timestamp: String!
    value: Float!
    p95: Float
    p99: Float
  }
  
  type Query {
    metrics(
      name: String!
      sourceId: String
      from: String!
      to: String!
      resolution: String!
    ): [MetricPoint!]!
  }
  
  type Subscription {
    metricUpdated(name: String!, sourceId: String): MetricPoint!
  }
`;

const resolvers = {
  Query: {
    metrics: async (_, { name, sourceId, from, to, resolution }) => {
      const cacheKey = `metrics:${name}:${sourceId}:${resolution}:${from}:${to}`;
      
      // Check cache first (TTL based on resolution)
      const cached = await redis.get(cacheKey);
      if (cached) return JSON.parse(cached);
      
      // Query appropriate aggregate table based on resolution
      const table = resolution === '1m' ? 'metrics_1m' : 
                    resolution === '1h' ? 'metrics_1h' : 'metrics';
      
      const result = await db.query(`
        SELECT bucket as timestamp, avg_value as value, p95, p99
        FROM ${table}
        WHERE metric_name = $1 
          AND ($2::text IS NULL OR source_id = $2)
          AND bucket >= $3 AND bucket <= $4
        ORDER BY bucket
      `, [name, sourceId, from, to]);
      
      // Cache with TTL proportional to data age
      const ttl = resolution === '1m' ? 10 : 60;
      await redis.setex(cacheKey, ttl, JSON.stringify(result.rows));
      
      return result.rows;
    }
  },
  
  Subscription: {
    metricUpdated: {
      subscribe: (_, { name, sourceId }) => {
        const channel = sourceId 
          ? `metric:${name}:${sourceId}` 
          : `metric:${name}`;
        return pubsub.asyncIterator(channel);
      }
    }
  }
};

// Publisher called by stream processor
export async function publishMetricUpdate(metric: MetricPoint) {
  await pubsub.publish(`metric:${metric.name}`, { 
    metricUpdated: metric 
  });
  await pubsub.publish(`metric:${metric.name}:${metric.sourceId}`, { 
    metricUpdated: metric 
  });
}

Cache TTLs should match your data freshness requirements. For a 10-second aggregation window, a 10-second cache TTL makes sense—you’re not hiding stale data, just preventing redundant queries.

Frontend Architecture & Data Push

The frontend must efficiently handle continuous updates without overwhelming the browser. React’s reconciliation handles DOM updates well, but you need to manage WebSocket connections and state carefully.

// React dashboard component with WebSocket updates
import { useEffect, useRef, useState, useCallback } from 'react';
import { LineChart, Line, XAxis, YAxis, ResponsiveContainer } from 'recharts';

interface MetricPoint {
  timestamp: string;
  value: number;
}

function useMetricStream(metricName: string, maxPoints: number = 60) {
  const [data, setData] = useState<MetricPoint[]>([]);
  const wsRef = useRef<WebSocket | null>(null);
  
  useEffect(() => {
    const ws = new WebSocket(
      `wss://api.example.com/metrics/stream?name=${metricName}`
    );
    
    ws.onmessage = (event) => {
      const point: MetricPoint = JSON.parse(event.data);
      
      setData(prev => {
        // Append new point, trim to maxPoints
        const updated = [...prev, point];
        return updated.slice(-maxPoints);
      });
    };
    
    ws.onclose = () => {
      // Exponential backoff reconnection
      setTimeout(() => {
        wsRef.current = new WebSocket(ws.url);
      }, 1000);
    };
    
    wsRef.current = ws;
    return () => ws.close();
  }, [metricName, maxPoints]);
  
  return data;
}

export function MetricChart({ metricName }: { metricName: string }) {
  const data = useMetricStream(metricName);
  
  return (
    <ResponsiveContainer width="100%" height={300}>
      <LineChart data={data}>
        <XAxis 
          dataKey="timestamp" 
          tickFormatter={(t) => new Date(t).toLocaleTimeString()}
        />
        <YAxis domain={['auto', 'auto']} />
        <Line 
          type="monotone" 
          dataKey="value" 
          stroke="#8884d8"
          dot={false}
          isAnimationActive={false}  // Disable animation for performance
        />
      </LineChart>
    </ResponsiveContainer>
  );
}

Disabling chart animations is crucial for real-time updates—animations queue up and create visual chaos. The rolling window approach (slice(-maxPoints)) prevents memory leaks from unbounded data accumulation.

Scaling & Reliability Considerations

Horizontal scaling follows the data flow. Add Kafka partitions and consumer instances together. Flink scales through parallelism configuration. TimescaleDB scales reads through replicas and writes through distributed hypertables.

Backpressure is your friend. When downstream systems can’t keep up, you want explicit signals, not silent data loss. Kafka’s consumer lag metrics and Flink’s backpressure indicators should trigger alerts before users notice degradation.

Graceful degradation means having fallbacks. If real-time aggregation fails, serve the last known good value with a staleness indicator. If WebSocket connections drop, fall back to polling. Users prefer slightly stale data over error screens.

Finally, monitor your monitoring. Your analytics pipeline needs its own observability—separate from the system it’s monitoring. Otherwise, you’ll discover outages when someone asks why the dashboard is blank.

The architecture described here handles millions of events per second at sub-second latency. Start simple, measure everything, and add complexity only when metrics justify it.