Design a Chat Application: Real-Time Messaging System

Key Insights

WebSockets are essential for real-time chat, but the real complexity lies in horizontal scaling—use a pub/sub layer like Redis to coordinate messages across multiple server instances.
Choose your fanout strategy based on group size: fan-out on write works for small groups (under 500 members), while fan-out on read is necessary for large channels to avoid write amplification.
Message ordering and delivery guarantees require careful schema design—use composite keys with conversation ID and timestamp, and implement idempotency keys to handle network retries gracefully.

Introduction & Requirements Analysis

Building a chat application seems straightforward until you hit scale. What starts as a simple “send message, receive message” flow quickly becomes a distributed systems challenge involving real-time delivery, message persistence, presence tracking, and graceful degradation.

Let’s establish our requirements upfront.

Functional Requirements:

One-to-one direct messaging
Group chats (up to 500 members for standard groups, unlimited for broadcast channels)
Online/offline presence indicators
Typing indicators
Message history with pagination
Read receipts
Offline message sync

Non-Functional Requirements:

Sub-200ms message delivery latency
99.9% availability
Message ordering within conversations
At-least-once delivery guarantee
Support for 10 million daily active users
1 billion messages per day at peak

These numbers inform every architectural decision. At 1 billion messages daily, that’s roughly 12,000 messages per second sustained, with peaks potentially 3-5x higher.

High-Level Architecture

The architecture separates concerns into distinct services that can scale independently:

┌─────────────┐     ┌─────────────────┐     ┌──────────────────┐
│   Clients   │────▶│   API Gateway   │────▶│   Auth Service   │
└─────────────┘     └─────────────────┘     └──────────────────┘
       │                    │
       │              ┌─────┴─────┐
       ▼              ▼           ▼
┌─────────────┐  ┌─────────┐  ┌─────────────┐
│  WebSocket  │  │  Chat   │  │  Presence   │
│   Servers   │  │ Service │  │   Service   │
└─────────────┘  └─────────┘  └─────────────┘
       │              │              │
       └──────────────┼──────────────┘
                      ▼
              ┌───────────────┐
              │  Redis Pub/Sub │
              └───────────────┘
                      │
       ┌──────────────┼──────────────┐
       ▼              ▼              ▼
┌─────────────┐ ┌───────────┐ ┌─────────────┐
│  Cassandra  │ │   Redis   │ │    Kafka    │
│  (Messages) │ │  (Cache)  │ │   (Queue)   │
└─────────────┘ └───────────┘ └─────────────┘

API Gateway handles HTTP requests for non-real-time operations: fetching message history, managing groups, updating profiles. WebSocket Servers maintain persistent connections for real-time message delivery. Chat Service processes message sending, storage, and fanout logic. Presence Service tracks online status and typing indicators. Redis Pub/Sub coordinates messages across WebSocket server instances.

Real-Time Communication Layer

For real-time messaging, WebSockets are the clear choice. Server-Sent Events (SSE) only support server-to-client communication, requiring a separate HTTP channel for sending messages. Long polling wastes resources with constant connection cycling. WebSockets provide full-duplex communication with minimal overhead.

The challenge is horizontal scaling. When User A connects to Server 1 and User B connects to Server 2, how does A’s message reach B? You need a pub/sub coordination layer.

// WebSocket server with Redis pub/sub coordination
const WebSocket = require('ws');
const Redis = require('ioredis');

const publisher = new Redis(process.env.REDIS_URL);
const subscriber = new Redis(process.env.REDIS_URL);

// Map of userId -> WebSocket connection (local to this server)
const connections = new Map();

const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', async (ws, req) => {
  const userId = await authenticateConnection(req);
  connections.set(userId, ws);
  
  // Subscribe to user's personal channel
  await subscriber.subscribe(`user:${userId}`);
  
  ws.on('message', async (data) => {
    const message = JSON.parse(data);
    
    if (message.type === 'chat') {
      await handleChatMessage(userId, message);
    } else if (message.type === 'typing') {
      await handleTypingIndicator(userId, message);
    }
  });
  
  ws.on('close', () => {
    connections.delete(userId);
    subscriber.unsubscribe(`user:${userId}`);
    updatePresence(userId, 'offline');
  });
  
  // Heartbeat for connection health
  ws.isAlive = true;
  ws.on('pong', () => { ws.isAlive = true; });
});

// Heartbeat interval to detect dead connections
setInterval(() => {
  wss.clients.forEach((ws) => {
    if (!ws.isAlive) return ws.terminate();
    ws.isAlive = false;
    ws.ping();
  });
}, 30000);

// Handle incoming pub/sub messages
subscriber.on('message', (channel, message) => {
  const userId = channel.split(':')[1];
  const ws = connections.get(userId);
  if (ws && ws.readyState === WebSocket.OPEN) {
    ws.send(message);
  }
});

async function handleChatMessage(senderId, message) {
  const { conversationId, content, recipientIds } = message;
  
  // Persist message first
  const savedMessage = await chatService.saveMessage({
    id: generateMessageId(),
    conversationId,
    senderId,
    content,
    timestamp: Date.now(),
    idempotencyKey: message.idempotencyKey
  });
  
  // Fan out to all recipients via Redis pub/sub
  for (const recipientId of recipientIds) {
    await publisher.publish(`user:${recipientId}`, JSON.stringify({
      type: 'new_message',
      message: savedMessage
    }));
  }
}

The key insight: each WebSocket server only knows about its local connections. Redis pub/sub broadcasts messages to all servers, and each server delivers to any locally-connected recipients.

Message Storage & Delivery

Message storage needs to handle high write throughput while supporting efficient reads by conversation. Cassandra excels here due to its write-optimized architecture and flexible partitioning.

-- Cassandra schema for messages
CREATE TABLE messages (
    conversation_id UUID,
    message_id TIMEUUID,
    sender_id UUID,
    content TEXT,
    content_type TEXT,  -- 'text', 'image', 'file'
    created_at TIMESTAMP,
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

-- Index for fetching user's conversations
CREATE TABLE user_conversations (
    user_id UUID,
    last_message_at TIMESTAMP,
    conversation_id UUID,
    last_message_preview TEXT,
    unread_count INT,
    PRIMARY KEY (user_id, last_message_at, conversation_id)
) WITH CLUSTERING ORDER BY (last_message_at DESC);

The messages table partitions by conversation_id, ensuring all messages in a conversation live on the same node for efficient range queries. The clustering order by message_id (a TIMEUUID) provides automatic chronological ordering.

For reliable delivery, use Kafka as a durable message queue:

# Message processing pipeline
from kafka import KafkaProducer, KafkaConsumer
import json

producer = KafkaProducer(
    bootstrap_servers=['kafka:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    acks='all',  # Wait for all replicas
    retries=3
)

def send_message(message):
    # Publish to Kafka for reliable processing
    producer.send(
        'chat-messages',
        key=message['conversation_id'].encode(),  # Partition by conversation
        value=message
    )

# Consumer for message processing
consumer = KafkaConsumer(
    'chat-messages',
    bootstrap_servers=['kafka:9092'],
    group_id='message-processors',
    enable_auto_commit=False
)

for record in consumer:
    message = json.loads(record.value)
    
    # Idempotency check
    if not is_duplicate(message['idempotency_key']):
        save_to_cassandra(message)
        fanout_to_recipients(message)
        mark_processed(message['idempotency_key'])
    
    consumer.commit()

Kafka’s partitioning by conversation ID ensures message ordering within conversations while allowing parallel processing across different conversations.

Group Chat & Fanout Strategies

The fanout strategy dramatically impacts system performance. Consider a group with 500 members: fan-out on write means 500 write operations per message. For active groups, this creates massive write amplification.

# Fanout strategy selection
async def fanout_message(message, group):
    member_count = len(group.members)
    
    if member_count <= 500:
        # Fan-out on write: push to each member's inbox
        await fanout_on_write(message, group.members)
    else:
        # Fan-out on read: store once, members pull
        await fanout_on_read(message, group.id)

async def fanout_on_write(message, members):
    """
    Best for small groups. Each member gets the message
    pushed to their personal queue immediately.
    Pros: Fast reads, simple client logic
    Cons: Write amplification, storage duplication
    """
    pipeline = redis.pipeline()
    for member_id in members:
        pipeline.publish(f'user:{member_id}', json.dumps(message))
        pipeline.lpush(f'inbox:{member_id}', json.dumps(message))
    await pipeline.execute()

async def fanout_on_read(message, group_id):
    """
    Best for large channels. Message stored once,
    clients fetch from group timeline.
    Pros: Efficient writes, no duplication
    Cons: More complex reads, potential hotspots
    """
    await redis.lpush(f'group:{group_id}:messages', json.dumps(message))
    
    # Notify online members that new message exists
    online_members = await get_online_members(group_id)
    for member_id in online_members:
        await redis.publish(f'user:{member_id}', json.dumps({
            'type': 'group_update',
            'group_id': group_id
        }))

For large channels (think Slack’s #general with 10,000 members), fan-out on read is essential. The “thundering herd” problem—thousands of clients simultaneously fetching after a notification—requires rate limiting and caching at the edge.

Presence & Typing Indicators

Presence tracking uses Redis with TTL-based expiration. Users send periodic heartbeats; absence of heartbeats means offline.

# Presence service
import asyncio
from redis import Redis

redis = Redis()
PRESENCE_TTL = 60  # seconds
HEARTBEAT_INTERVAL = 30

async def handle_heartbeat(user_id: str):
    """Update user's online status with TTL"""
    await redis.setex(f'presence:{user_id}', PRESENCE_TTL, 'online')
    
async def get_presence(user_ids: list[str]) -> dict:
    """Batch fetch presence for multiple users"""
    pipeline = redis.pipeline()
    for user_id in user_ids:
        pipeline.get(f'presence:{user_id}')
    results = await pipeline.execute()
    
    return {
        user_id: 'online' if result else 'offline'
        for user_id, result in zip(user_ids, results)
    }

async def handle_typing(user_id: str, conversation_id: str):
    """Typing indicator with short TTL and debouncing"""
    key = f'typing:{conversation_id}:{user_id}'
    
    # Only publish if not already typing (debounce)
    if not await redis.exists(key):
        await notify_conversation_members(conversation_id, {
            'type': 'typing_start',
            'user_id': user_id
        })
    
    # Reset TTL
    await redis.setex(key, 3, '1')  # 3 second TTL

Typing indicators are ephemeral—they don’t need persistence or guaranteed delivery. A 3-second TTL automatically clears stale typing states without explicit “stopped typing” messages.

Scaling & Reliability Considerations

Sharding Strategy: Shard by conversation ID for message storage (keeps conversation data co-located) and by user ID for user data and presence (keeps user’s data together).

Caching: Cache recent messages per conversation, group membership lists, and user profiles. Invalidate on writes using cache-aside pattern.

Geographic Distribution: Deploy WebSocket servers in multiple regions. Route users to nearest region, but ensure messages can cross regions via the Kafka backbone.

Message Deduplication: Clients should send idempotency keys with each message. The server checks against a short-lived cache (Redis with 24-hour TTL) before processing.

End-to-End Encryption: For E2E encryption, the server never sees plaintext. Clients exchange keys via the server, but encryption/decryption happens client-side. This complicates features like server-side search but is essential for privacy-focused applications.

The architecture outlined here handles the stated requirements while leaving room for optimization. Start with the simpler approaches (fan-out on write, single-region deployment) and add complexity only when metrics demand it. Premature optimization in distributed systems often creates more problems than it solves.