Design a File Storage System: Distributed File System
A distributed file system stores files across multiple machines, presenting them as a unified namespace to clients. You need one when a single machine can't handle your storage capacity, throughput...
Key Insights
- Distributed file systems separate metadata management from data storage, allowing the metadata server to remain lightweight while data servers scale horizontally to petabytes
- Chunking files into large blocks (64-256MB) and replicating them across failure domains provides fault tolerance without sacrificing throughput for large sequential reads
- The write path requires careful coordination through a primary replica to maintain consistency, while reads can be served from any replica for better availability
Introduction & Requirements
A distributed file system stores files across multiple machines, presenting them as a unified namespace to clients. You need one when a single machine can’t handle your storage capacity, throughput requirements, or availability needs. Google built GFS to handle their crawl data. Hadoop created HDFS for MapReduce workloads. The principles remain consistent across implementations.
Let’s define what we’re building:
interface FileSystemRequirements {
functional: {
createFile(path: string, data: Buffer): Promise<FileHandle>;
readFile(path: string, offset: number, length: number): Promise<Buffer>;
deleteFile(path: string): Promise<void>;
listDirectory(path: string): Promise<FileMetadata[]>;
getFileInfo(path: string): Promise<FileMetadata>;
};
nonFunctional: {
storageCapacity: "petabytes";
fileSizeSupport: "gigabytes to terabytes";
consistencyModel: "strong consistency for metadata, eventual for data";
availabilityTarget: 99.9; // percentage
readLatency: "< 100ms for metadata, throughput-bound for data";
writeLatency: "< 500ms for small files";
replicationFactor: 3;
};
}
interface FileMetadata {
path: string;
size: number;
chunkIds: string[];
createdAt: Date;
modifiedAt: Date;
permissions: number;
version: number;
}
We’re optimizing for large files with sequential access patterns. Small files and random access patterns require different design decisions that we won’t cover here.
High-Level Architecture
The architecture separates concerns into three layers: clients, a metadata server (master), and data servers (chunk servers).
┌─────────────────────────────────────────────────────────────────┐
│ CLIENTS │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Client 1 │ │ Client 2 │ │ Client 3 │ │ Client N │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└───────┼─────────────┼─────────────┼─────────────┼───────────────┘
│ │ │ │
│ metadata │ │ │ metadata
│ requests │ │ │ requests
▼ │ │ ▼
┌───────────────────────────────────────────────────────────────┐
│ MASTER NODE │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Namespace │ │ Chunk Map │ │ Operation Log │ │
│ │ Tree │ │ (chunk→svr) │ │ + Checkpoints │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
│
│ chunk locations
▼
┌───────────────────────────────────────────────────────────────┐
│ CHUNK SERVERS │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Server 1 │ │ Server 2 │ │ Server 3 │ │ Server N │ │
│ │ [chunks] │ │ [chunks] │ │ [chunks] │ │ [chunks] │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└───────────────────────────────────────────────────────────────┘
class MasterNode {
private namespace: NamespaceTree;
private chunkMap: Map<string, ChunkLocation[]>;
private operationLog: OperationLog;
private chunkServers: Map<string, ChunkServerState>;
async getChunkLocations(fileId: string): Promise<ChunkLocation[]> {
const metadata = this.namespace.getFile(fileId);
return metadata.chunkIds.flatMap(id => this.chunkMap.get(id) || []);
}
}
class ChunkServer {
private serverId: string;
private chunks: Map<string, ChunkData>;
private masterClient: MasterClient;
async storeChunk(chunkId: string, data: Buffer): Promise<void> {
await this.chunks.set(chunkId, { data, checksum: this.computeChecksum(data) });
await this.masterClient.reportChunk(this.serverId, chunkId);
}
}
Clients contact the master for metadata operations, then communicate directly with chunk servers for data transfer. This keeps the master from becoming a bottleneck for large data transfers.
File Chunking & Data Distribution
Files are split into fixed-size chunks. GFS uses 64MB; HDFS defaults to 128MB. Larger chunks reduce metadata overhead and favor sequential reads. Smaller chunks provide better parallelism for random access.
const CHUNK_SIZE = 64 * 1024 * 1024; // 64MB
function splitFileIntoChunks(data: Buffer): Chunk[] {
const chunks: Chunk[] = [];
let offset = 0;
while (offset < data.length) {
const chunkData = data.slice(offset, offset + CHUNK_SIZE);
chunks.push({
id: generateChunkId(),
data: chunkData,
checksum: computeChecksum(chunkData),
index: chunks.length,
});
offset += CHUNK_SIZE;
}
return chunks;
}
class ConsistentHashRing {
private ring: Map<number, string> = new Map();
private sortedKeys: number[] = [];
private virtualNodes: number = 150;
addServer(serverId: string): void {
for (let i = 0; i < this.virtualNodes; i++) {
const hash = this.hash(`${serverId}:${i}`);
this.ring.set(hash, serverId);
this.sortedKeys.push(hash);
}
this.sortedKeys.sort((a, b) => a - b);
}
getServersForChunk(chunkId: string, replicaCount: number): string[] {
const hash = this.hash(chunkId);
const servers: Set<string> = new Set();
let index = this.findStartIndex(hash);
while (servers.size < replicaCount && servers.size < this.getUniqueServerCount()) {
const server = this.ring.get(this.sortedKeys[index % this.sortedKeys.length])!;
servers.add(server);
index++;
}
return Array.from(servers);
}
private hash(key: string): number {
// Use a proper hash function like MurmurHash3 in production
let hash = 0;
for (let i = 0; i < key.length; i++) {
hash = ((hash << 5) - hash) + key.charCodeAt(i);
hash = hash & hash;
}
return Math.abs(hash);
}
}
Rack-aware placement ensures replicas survive rack failures. The first replica goes to a random server, the second to a different rack, and the third to another server in the second rack. This balances fault tolerance with network locality.
Metadata Management
The master maintains all metadata in memory for fast lookups. This works because metadata is small—roughly 64 bytes per chunk. A system with 100 million chunks needs only ~6GB of RAM.
interface NamespaceNode {
name: string;
isDirectory: boolean;
children?: Map<string, NamespaceNode>;
fileMetadata?: {
size: number;
chunkIds: string[];
version: number;
permissions: number;
};
}
class NamespaceTree {
private root: NamespaceNode = { name: "/", isDirectory: true, children: new Map() };
private operationLog: OperationLog;
async createFile(path: string, chunkIds: string[]): Promise<void> {
// Log operation before applying
await this.operationLog.append({
type: "CREATE_FILE",
path,
chunkIds,
timestamp: Date.now(),
});
const parts = path.split("/").filter(Boolean);
const fileName = parts.pop()!;
const parent = this.navigateToParent(parts);
parent.children!.set(fileName, {
name: fileName,
isDirectory: false,
fileMetadata: {
size: chunkIds.length * CHUNK_SIZE,
chunkIds,
version: 1,
permissions: 0o644,
},
});
}
async checkpoint(): Promise<void> {
const snapshot = this.serializeTree(this.root);
await this.writeCheckpoint(snapshot);
await this.operationLog.truncateBefore(Date.now());
}
}
class OperationLog {
private logFile: FileHandle;
async append(operation: Operation): Promise<void> {
const serialized = JSON.stringify(operation) + "\n";
await this.logFile.appendFile(serialized);
await this.logFile.sync(); // fsync for durability
}
async replay(fromCheckpoint: NamespaceTree): Promise<NamespaceTree> {
const operations = await this.readAllOperations();
for (const op of operations) {
await fromCheckpoint.applyOperation(op);
}
return fromCheckpoint;
}
}
Checkpoints happen periodically. Recovery loads the latest checkpoint and replays subsequent operations from the log.
Consistency & Replication
Writes flow through a primary replica that orders mutations. The master grants leases to primaries, typically lasting 60 seconds with renewal.
class WriteCoordinator {
async writeChunk(chunkId: string, data: Buffer, servers: ChunkLocation[]): Promise<void> {
const [primary, ...secondaries] = servers;
// Phase 1: Push data to all replicas
await Promise.all(
servers.map(server => this.pushDataToCache(server, chunkId, data))
);
// Phase 2: Primary orders and commits
const serialNumber = await this.requestCommit(primary, chunkId);
// Phase 3: Primary forwards to secondaries
await this.waitForSecondaryAcks(primary, chunkId, serialNumber);
}
private async pushDataToCache(server: ChunkLocation, chunkId: string, data: Buffer): Promise<void> {
// Data pushed along chain: client → server1 → server2 → server3
// This maximizes network bandwidth utilization
await this.chunkServerClient.pushToCache(server, chunkId, data);
}
private async requestCommit(primary: ChunkLocation, chunkId: string): Promise<number> {
return await this.chunkServerClient.commit(primary, chunkId);
}
}
class ChunkServerReplicaManager {
private leases: Map<string, Lease> = new Map();
async handleWrite(chunkId: string, serialNumber: number): Promise<void> {
const lease = this.leases.get(chunkId);
if (!lease || lease.isPrimary === false) {
throw new Error("Not primary for this chunk");
}
// Apply write locally
await this.applyMutation(chunkId, serialNumber);
// Forward to secondaries
const acks = await Promise.all(
lease.secondaries.map(s => this.forwardToSecondary(s, chunkId, serialNumber))
);
if (acks.filter(Boolean).length < lease.secondaries.length) {
// Handle partial failure - mark replicas as stale
await this.reportStaleReplicas(chunkId, acks);
}
}
}
Fault Tolerance & Recovery
Chunk servers send heartbeats every few seconds. Missing heartbeats trigger re-replication.
class HealthMonitor {
private serverLastSeen: Map<string, number> = new Map();
private readonly HEARTBEAT_INTERVAL = 3000;
private readonly FAILURE_THRESHOLD = 10000;
async monitorServers(): Promise<void> {
setInterval(() => this.checkForFailures(), this.HEARTBEAT_INTERVAL);
}
handleHeartbeat(serverId: string, chunkReport: string[]): void {
this.serverLastSeen.set(serverId, Date.now());
this.updateChunkMap(serverId, chunkReport);
}
private async checkForFailures(): Promise<void> {
const now = Date.now();
for (const [serverId, lastSeen] of this.serverLastSeen) {
if (now - lastSeen > this.FAILURE_THRESHOLD) {
await this.handleServerFailure(serverId);
}
}
}
private async handleServerFailure(serverId: string): Promise<void> {
console.log(`Server ${serverId} marked as failed`);
const affectedChunks = this.getChunksOnServer(serverId);
for (const chunkId of affectedChunks) {
const remainingReplicas = this.getHealthyReplicas(chunkId);
if (remainingReplicas.length < this.targetReplicationFactor) {
await this.scheduleReReplication(chunkId, remainingReplicas);
}
}
this.serverLastSeen.delete(serverId);
}
private async scheduleReReplication(chunkId: string, sourceReplicas: string[]): Promise<void> {
const targetServer = this.selectReplicationTarget(chunkId);
const sourceServer = sourceReplicas[0];
await this.chunkServerClient.copyChunk(sourceServer, targetServer, chunkId);
}
}
Optimizations & Production Considerations
A production client needs retry logic, connection pooling, and caching:
class DistributedFileSystemClient {
private masterClient: MasterClient;
private chunkServerPool: ConnectionPool;
private metadataCache: LRUCache<string, FileMetadata>;
private readonly MAX_RETRIES = 3;
async readFile(path: string, offset: number, length: number): Promise<Buffer> {
const metadata = await this.getMetadataWithCache(path);
const chunks = this.calculateChunksNeeded(metadata, offset, length);
const chunkData = await Promise.all(
chunks.map(chunk => this.readChunkWithRetry(chunk))
);
return this.assembleData(chunkData, offset, length);
}
private async readChunkWithRetry(chunk: ChunkInfo): Promise<Buffer> {
let lastError: Error | null = null;
for (let attempt = 0; attempt < this.MAX_RETRIES; attempt++) {
const locations = await this.masterClient.getChunkLocations(chunk.id);
for (const location of locations) {
try {
const conn = await this.chunkServerPool.acquire(location.serverId);
const data = await conn.readChunk(chunk.id, chunk.offset, chunk.length);
this.chunkServerPool.release(conn);
return data;
} catch (error) {
lastError = error as Error;
continue; // Try next replica
}
}
await this.sleep(Math.pow(2, attempt) * 100); // Exponential backoff
}
throw new Error(`Failed to read chunk ${chunk.id}: ${lastError?.message}`);
}
private async getMetadataWithCache(path: string): Promise<FileMetadata> {
const cached = this.metadataCache.get(path);
if (cached) return cached;
const metadata = await this.masterClient.getFileInfo(path);
this.metadataCache.set(path, metadata);
return metadata;
}
}
Real systems like HDFS add features we haven’t covered: erasure coding for storage efficiency, tiered storage for cost optimization, and federation for namespace scaling. Ceph takes a different approach with CRUSH maps for deterministic placement, eliminating the central metadata server bottleneck.
The fundamentals remain: separate metadata from data, chunk files for parallel access, replicate for durability, and coordinate writes through a primary. These principles scale from gigabytes to exabytes.