Design a Chat Application: Real-Time Messaging System

Building a chat application seems straightforward until you hit scale. What starts as a simple 'send message, receive message' flow quickly becomes a distributed systems challenge involving real-time...

Key Insights

  • WebSockets are essential for real-time chat, but the real complexity lies in horizontal scaling—use a pub/sub layer like Redis to coordinate messages across multiple server instances.
  • Choose your fanout strategy based on group size: fan-out on write works for small groups (under 500 members), while fan-out on read is necessary for large channels to avoid write amplification.
  • Message ordering and delivery guarantees require careful schema design—use composite keys with conversation ID and timestamp, and implement idempotency keys to handle network retries gracefully.

Introduction & Requirements Analysis

Building a chat application seems straightforward until you hit scale. What starts as a simple “send message, receive message” flow quickly becomes a distributed systems challenge involving real-time delivery, message persistence, presence tracking, and graceful degradation.

Let’s establish our requirements upfront.

Functional Requirements:

  • One-to-one direct messaging
  • Group chats (up to 500 members for standard groups, unlimited for broadcast channels)
  • Online/offline presence indicators
  • Typing indicators
  • Message history with pagination
  • Read receipts
  • Offline message sync

Non-Functional Requirements:

  • Sub-200ms message delivery latency
  • 99.9% availability
  • Message ordering within conversations
  • At-least-once delivery guarantee
  • Support for 10 million daily active users
  • 1 billion messages per day at peak

These numbers inform every architectural decision. At 1 billion messages daily, that’s roughly 12,000 messages per second sustained, with peaks potentially 3-5x higher.

High-Level Architecture

The architecture separates concerns into distinct services that can scale independently:

┌─────────────┐     ┌─────────────────┐     ┌──────────────────┐
│   Clients   │────▶│   API Gateway   │────▶│   Auth Service   │
└─────────────┘     └─────────────────┘     └──────────────────┘
       │                    │
       │              ┌─────┴─────┐
       ▼              ▼           ▼
┌─────────────┐  ┌─────────┐  ┌─────────────┐
│  WebSocket  │  │  Chat   │  │  Presence   │
│   Servers   │  │ Service │  │   Service   │
└─────────────┘  └─────────┘  └─────────────┘
       │              │              │
       └──────────────┼──────────────┘
              ┌───────────────┐
              │  Redis Pub/Sub │
              └───────────────┘
       ┌──────────────┼──────────────┐
       ▼              ▼              ▼
┌─────────────┐ ┌───────────┐ ┌─────────────┐
│  Cassandra  │ │   Redis   │ │    Kafka    │
│  (Messages) │ │  (Cache)  │ │   (Queue)   │
└─────────────┘ └───────────┘ └─────────────┘

API Gateway handles HTTP requests for non-real-time operations: fetching message history, managing groups, updating profiles. WebSocket Servers maintain persistent connections for real-time message delivery. Chat Service processes message sending, storage, and fanout logic. Presence Service tracks online status and typing indicators. Redis Pub/Sub coordinates messages across WebSocket server instances.

Real-Time Communication Layer

For real-time messaging, WebSockets are the clear choice. Server-Sent Events (SSE) only support server-to-client communication, requiring a separate HTTP channel for sending messages. Long polling wastes resources with constant connection cycling. WebSockets provide full-duplex communication with minimal overhead.

The challenge is horizontal scaling. When User A connects to Server 1 and User B connects to Server 2, how does A’s message reach B? You need a pub/sub coordination layer.

// WebSocket server with Redis pub/sub coordination
const WebSocket = require('ws');
const Redis = require('ioredis');

const publisher = new Redis(process.env.REDIS_URL);
const subscriber = new Redis(process.env.REDIS_URL);

// Map of userId -> WebSocket connection (local to this server)
const connections = new Map();

const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', async (ws, req) => {
  const userId = await authenticateConnection(req);
  connections.set(userId, ws);
  
  // Subscribe to user's personal channel
  await subscriber.subscribe(`user:${userId}`);
  
  ws.on('message', async (data) => {
    const message = JSON.parse(data);
    
    if (message.type === 'chat') {
      await handleChatMessage(userId, message);
    } else if (message.type === 'typing') {
      await handleTypingIndicator(userId, message);
    }
  });
  
  ws.on('close', () => {
    connections.delete(userId);
    subscriber.unsubscribe(`user:${userId}`);
    updatePresence(userId, 'offline');
  });
  
  // Heartbeat for connection health
  ws.isAlive = true;
  ws.on('pong', () => { ws.isAlive = true; });
});

// Heartbeat interval to detect dead connections
setInterval(() => {
  wss.clients.forEach((ws) => {
    if (!ws.isAlive) return ws.terminate();
    ws.isAlive = false;
    ws.ping();
  });
}, 30000);

// Handle incoming pub/sub messages
subscriber.on('message', (channel, message) => {
  const userId = channel.split(':')[1];
  const ws = connections.get(userId);
  if (ws && ws.readyState === WebSocket.OPEN) {
    ws.send(message);
  }
});

async function handleChatMessage(senderId, message) {
  const { conversationId, content, recipientIds } = message;
  
  // Persist message first
  const savedMessage = await chatService.saveMessage({
    id: generateMessageId(),
    conversationId,
    senderId,
    content,
    timestamp: Date.now(),
    idempotencyKey: message.idempotencyKey
  });
  
  // Fan out to all recipients via Redis pub/sub
  for (const recipientId of recipientIds) {
    await publisher.publish(`user:${recipientId}`, JSON.stringify({
      type: 'new_message',
      message: savedMessage
    }));
  }
}

The key insight: each WebSocket server only knows about its local connections. Redis pub/sub broadcasts messages to all servers, and each server delivers to any locally-connected recipients.

Message Storage & Delivery

Message storage needs to handle high write throughput while supporting efficient reads by conversation. Cassandra excels here due to its write-optimized architecture and flexible partitioning.

-- Cassandra schema for messages
CREATE TABLE messages (
    conversation_id UUID,
    message_id TIMEUUID,
    sender_id UUID,
    content TEXT,
    content_type TEXT,  -- 'text', 'image', 'file'
    created_at TIMESTAMP,
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

-- Index for fetching user's conversations
CREATE TABLE user_conversations (
    user_id UUID,
    last_message_at TIMESTAMP,
    conversation_id UUID,
    last_message_preview TEXT,
    unread_count INT,
    PRIMARY KEY (user_id, last_message_at, conversation_id)
) WITH CLUSTERING ORDER BY (last_message_at DESC);

The messages table partitions by conversation_id, ensuring all messages in a conversation live on the same node for efficient range queries. The clustering order by message_id (a TIMEUUID) provides automatic chronological ordering.

For reliable delivery, use Kafka as a durable message queue:

# Message processing pipeline
from kafka import KafkaProducer, KafkaConsumer
import json

producer = KafkaProducer(
    bootstrap_servers=['kafka:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    acks='all',  # Wait for all replicas
    retries=3
)

def send_message(message):
    # Publish to Kafka for reliable processing
    producer.send(
        'chat-messages',
        key=message['conversation_id'].encode(),  # Partition by conversation
        value=message
    )

# Consumer for message processing
consumer = KafkaConsumer(
    'chat-messages',
    bootstrap_servers=['kafka:9092'],
    group_id='message-processors',
    enable_auto_commit=False
)

for record in consumer:
    message = json.loads(record.value)
    
    # Idempotency check
    if not is_duplicate(message['idempotency_key']):
        save_to_cassandra(message)
        fanout_to_recipients(message)
        mark_processed(message['idempotency_key'])
    
    consumer.commit()

Kafka’s partitioning by conversation ID ensures message ordering within conversations while allowing parallel processing across different conversations.

Group Chat & Fanout Strategies

The fanout strategy dramatically impacts system performance. Consider a group with 500 members: fan-out on write means 500 write operations per message. For active groups, this creates massive write amplification.

# Fanout strategy selection
async def fanout_message(message, group):
    member_count = len(group.members)
    
    if member_count <= 500:
        # Fan-out on write: push to each member's inbox
        await fanout_on_write(message, group.members)
    else:
        # Fan-out on read: store once, members pull
        await fanout_on_read(message, group.id)

async def fanout_on_write(message, members):
    """
    Best for small groups. Each member gets the message
    pushed to their personal queue immediately.
    Pros: Fast reads, simple client logic
    Cons: Write amplification, storage duplication
    """
    pipeline = redis.pipeline()
    for member_id in members:
        pipeline.publish(f'user:{member_id}', json.dumps(message))
        pipeline.lpush(f'inbox:{member_id}', json.dumps(message))
    await pipeline.execute()

async def fanout_on_read(message, group_id):
    """
    Best for large channels. Message stored once,
    clients fetch from group timeline.
    Pros: Efficient writes, no duplication
    Cons: More complex reads, potential hotspots
    """
    await redis.lpush(f'group:{group_id}:messages', json.dumps(message))
    
    # Notify online members that new message exists
    online_members = await get_online_members(group_id)
    for member_id in online_members:
        await redis.publish(f'user:{member_id}', json.dumps({
            'type': 'group_update',
            'group_id': group_id
        }))

For large channels (think Slack’s #general with 10,000 members), fan-out on read is essential. The “thundering herd” problem—thousands of clients simultaneously fetching after a notification—requires rate limiting and caching at the edge.

Presence & Typing Indicators

Presence tracking uses Redis with TTL-based expiration. Users send periodic heartbeats; absence of heartbeats means offline.

# Presence service
import asyncio
from redis import Redis

redis = Redis()
PRESENCE_TTL = 60  # seconds
HEARTBEAT_INTERVAL = 30

async def handle_heartbeat(user_id: str):
    """Update user's online status with TTL"""
    await redis.setex(f'presence:{user_id}', PRESENCE_TTL, 'online')
    
async def get_presence(user_ids: list[str]) -> dict:
    """Batch fetch presence for multiple users"""
    pipeline = redis.pipeline()
    for user_id in user_ids:
        pipeline.get(f'presence:{user_id}')
    results = await pipeline.execute()
    
    return {
        user_id: 'online' if result else 'offline'
        for user_id, result in zip(user_ids, results)
    }

async def handle_typing(user_id: str, conversation_id: str):
    """Typing indicator with short TTL and debouncing"""
    key = f'typing:{conversation_id}:{user_id}'
    
    # Only publish if not already typing (debounce)
    if not await redis.exists(key):
        await notify_conversation_members(conversation_id, {
            'type': 'typing_start',
            'user_id': user_id
        })
    
    # Reset TTL
    await redis.setex(key, 3, '1')  # 3 second TTL

Typing indicators are ephemeral—they don’t need persistence or guaranteed delivery. A 3-second TTL automatically clears stale typing states without explicit “stopped typing” messages.

Scaling & Reliability Considerations

Sharding Strategy: Shard by conversation ID for message storage (keeps conversation data co-located) and by user ID for user data and presence (keeps user’s data together).

Caching: Cache recent messages per conversation, group membership lists, and user profiles. Invalidate on writes using cache-aside pattern.

Geographic Distribution: Deploy WebSocket servers in multiple regions. Route users to nearest region, but ensure messages can cross regions via the Kafka backbone.

Message Deduplication: Clients should send idempotency keys with each message. The server checks against a short-lived cache (Redis with 24-hour TTL) before processing.

End-to-End Encryption: For E2E encryption, the server never sees plaintext. Clients exchange keys via the server, but encryption/decryption happens client-side. This complicates features like server-side search but is essential for privacy-focused applications.

The architecture outlined here handles the stated requirements while leaving room for optimization. Start with the simpler approaches (fan-out on write, single-region deployment) and add complexity only when metrics demand it. Premature optimization in distributed systems often creates more problems than it solves.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.