Service Registry: Dynamic Service Location
Hardcoded service URLs work until they don't. The moment you scale beyond a single instance, deploy to containers, or implement any form of auto-scaling, static configuration becomes a liability....
Key Insights
- Service registries solve the fundamental problem of dynamic service location in distributed systems where IP addresses and ports change constantly due to scaling, deployments, and failures.
- Client-side discovery offers better performance and resilience but increases client complexity, while server-side discovery centralizes logic but introduces a potential single point of failure.
- Health checking is non-negotiable—a registry with stale entries is worse than no registry at all because it actively routes traffic to dead instances.
Introduction to Service Registry
Hardcoded service URLs work until they don’t. The moment you scale beyond a single instance, deploy to containers, or implement any form of auto-scaling, static configuration becomes a liability. Your payment service can’t call http://orders-service:8080 when there are now three order service instances running on dynamically assigned ports.
A service registry solves this by maintaining a real-time directory of available service instances. Services register themselves when they start, deregister when they stop, and clients query the registry to discover where to send requests. It’s the phone book for your microservices architecture.
The pattern becomes essential in three scenarios: container orchestration where instances spin up and down constantly, cloud deployments with auto-scaling groups, and any environment where you need zero-downtime deployments. If you’re still editing configuration files to add new service instances, you’re doing it wrong.
Core Concepts and Architecture
A service registry operates on three fundamental operations: registration, discovery, and health verification.
Registration happens when a service instance starts. It announces itself to the registry with its network location, metadata (version, environment, capabilities), and health check configuration.
Discovery is the lookup process. When Service A needs to call Service B, it queries the registry for all healthy instances of Service B, then selects one using a load balancing strategy.
Health checking ensures the registry only returns instances that can actually handle requests. This is where most naive implementations fail—they register services but never verify they’re still alive.
Two registration patterns dominate:
Self-registration means the service instance is responsible for registering and deregistering itself. This keeps infrastructure simple but couples your application code to registry logic.
Third-party registration uses a separate component (often called a registrar) that watches for new instances and handles registration. This decouples services from registry awareness but adds operational complexity.
// Core service registry interface
interface ServiceInstance {
id: string;
serviceName: string;
host: string;
port: number;
metadata: Record<string, string>;
healthCheckUrl: string;
registeredAt: Date;
lastHeartbeat: Date;
}
interface ServiceRegistry {
register(instance: Omit<ServiceInstance, 'id' | 'registeredAt' | 'lastHeartbeat'>): Promise<string>;
deregister(instanceId: string): Promise<void>;
discover(serviceName: string): Promise<ServiceInstance[]>;
heartbeat(instanceId: string): Promise<void>;
}
This interface captures the essential operations. The register method returns an instance ID that the service uses for subsequent heartbeats and eventual deregistration. The discover method returns all healthy instances, leaving load balancing decisions to the caller.
Implementation Patterns
Client-Side Discovery
In client-side discovery, the calling service queries the registry directly and handles load balancing itself. Netflix’s Eureka popularized this approach. The client maintains a local cache of service locations, refreshes it periodically, and selects instances using round-robin, random, or weighted algorithms.
class DiscoveryClient {
private cache: Map<string, ServiceInstance[]> = new Map();
private registryUrl: string;
private refreshIntervalMs: number;
private roundRobinIndex: Map<string, number> = new Map();
constructor(registryUrl: string, refreshIntervalMs: number = 30000) {
this.registryUrl = registryUrl;
this.refreshIntervalMs = refreshIntervalMs;
this.startBackgroundRefresh();
}
private async startBackgroundRefresh(): Promise<void> {
setInterval(async () => {
for (const serviceName of this.cache.keys()) {
await this.refreshService(serviceName);
}
}, this.refreshIntervalMs);
}
private async refreshService(serviceName: string): Promise<void> {
try {
const response = await fetch(`${this.registryUrl}/services/${serviceName}`);
const instances: ServiceInstance[] = await response.json();
this.cache.set(serviceName, instances);
} catch (error) {
console.warn(`Failed to refresh ${serviceName}, using cached data`);
}
}
async getInstance(serviceName: string): Promise<ServiceInstance | null> {
if (!this.cache.has(serviceName)) {
await this.refreshService(serviceName);
}
const instances = this.cache.get(serviceName) || [];
if (instances.length === 0) return null;
// Round-robin selection
const currentIndex = this.roundRobinIndex.get(serviceName) || 0;
const instance = instances[currentIndex % instances.length];
this.roundRobinIndex.set(serviceName, currentIndex + 1);
return instance;
}
getServiceUrl(serviceName: string): Promise<string | null> {
return this.getInstance(serviceName).then(instance =>
instance ? `http://${instance.host}:${instance.port}` : null
);
}
}
The cache is critical here. Without it, every service call would require a registry lookup, creating a bottleneck and single point of failure. The background refresh keeps the cache reasonably fresh while the service continues operating even if the registry becomes temporarily unavailable.
Server-Side Discovery
Server-side discovery places a load balancer or API gateway between clients and services. Clients call a fixed endpoint, and the load balancer queries the registry and routes requests. AWS ALB with ECS, Kubernetes Services, and traditional reverse proxies follow this pattern.
The trade-off is clear: client-side discovery distributes load balancing logic and eliminates a network hop, but every client needs discovery-aware code. Server-side discovery centralizes logic and keeps clients simple, but the load balancer becomes a critical dependency.
Choose client-side when you control all clients and need maximum performance. Choose server-side when you have diverse clients (mobile apps, third-party integrations) or want to keep services registry-agnostic.
Health Checking and Heartbeats
A registry that returns dead instances is actively harmful. Health checking prevents this through two mechanisms:
TTL-based registration requires services to send periodic heartbeats. Miss too many heartbeats, and you’re automatically deregistered. This is simple and works well for services that can reliably send heartbeats.
Active health checks have the registry (or a separate component) periodically probe service health endpoints. This catches cases where a service is running but not functioning correctly—the process is alive but the database connection is dead.
class HeartbeatManager {
private registryUrl: string;
private instanceId: string;
private intervalMs: number;
private intervalHandle: NodeJS.Timeout | null = null;
private consecutiveFailures: number = 0;
private maxFailures: number;
constructor(
registryUrl: string,
instanceId: string,
intervalMs: number = 10000,
maxFailures: number = 3
) {
this.registryUrl = registryUrl;
this.instanceId = instanceId;
this.intervalMs = intervalMs;
this.maxFailures = maxFailures;
}
start(): void {
this.intervalHandle = setInterval(() => this.sendHeartbeat(), this.intervalMs);
this.sendHeartbeat(); // Send immediately on start
}
stop(): void {
if (this.intervalHandle) {
clearInterval(this.intervalHandle);
this.intervalHandle = null;
}
}
private async sendHeartbeat(): Promise<void> {
try {
const response = await fetch(
`${this.registryUrl}/instances/${this.instanceId}/heartbeat`,
{ method: 'PUT' }
);
if (response.ok) {
this.consecutiveFailures = 0;
} else if (response.status === 404) {
console.error('Instance not found in registry, re-registering...');
await this.reregister();
}
} catch (error) {
this.consecutiveFailures++;
console.warn(`Heartbeat failed (${this.consecutiveFailures}/${this.maxFailures})`);
if (this.consecutiveFailures >= this.maxFailures) {
console.error('Max heartbeat failures reached, initiating graceful shutdown');
process.emit('SIGTERM');
}
}
}
private async reregister(): Promise<void> {
// Re-registration logic here
}
}
The failure handling is crucial. When heartbeats fail, the service should either attempt re-registration or shut down gracefully. Running without registry presence means no traffic will reach you anyway—better to restart and register fresh.
Building a Lightweight Service Registry
Understanding the mechanics helps even if you’ll use Consul or Kubernetes in production. Here’s a minimal registry server:
import express from 'express';
import { v4 as uuidv4 } from 'uuid';
interface StoredInstance extends ServiceInstance {
ttlSeconds: number;
}
const app = express();
app.use(express.json());
const instances: Map<string, StoredInstance> = new Map();
const TTL_CHECK_INTERVAL = 5000;
// Evict expired instances
setInterval(() => {
const now = new Date();
for (const [id, instance] of instances) {
const expiresAt = new Date(instance.lastHeartbeat.getTime() + instance.ttlSeconds * 1000);
if (now > expiresAt) {
console.log(`Evicting stale instance: ${id}`);
instances.delete(id);
}
}
}, TTL_CHECK_INTERVAL);
app.post('/instances', (req, res) => {
const id = uuidv4();
const instance: StoredInstance = {
id,
...req.body,
registeredAt: new Date(),
lastHeartbeat: new Date(),
ttlSeconds: req.body.ttlSeconds || 30
};
instances.set(id, instance);
res.status(201).json({ id });
});
app.delete('/instances/:id', (req, res) => {
instances.delete(req.params.id);
res.status(204).send();
});
app.put('/instances/:id/heartbeat', (req, res) => {
const instance = instances.get(req.params.id);
if (!instance) {
return res.status(404).json({ error: 'Instance not found' });
}
instance.lastHeartbeat = new Date();
res.status(200).send();
});
app.get('/services/:name', (req, res) => {
const matching = Array.from(instances.values())
.filter(i => i.serviceName === req.params.name);
res.json(matching);
});
app.listen(8500, () => console.log('Registry running on :8500'));
This implementation handles the core workflow but lacks everything needed for production: persistence, clustering, authentication, and proper health checking. It’s educational, not operational.
Production Considerations
For production, use established tools:
Consul provides service discovery, health checking, and a distributed key-value store. It handles clustering natively and integrates with most service meshes.
service {
name = "order-service"
port = 8080
check {
http = "http://localhost:8080/health"
interval = "10s"
timeout = "2s"
}
meta {
version = "2.1.0"
env = "production"
}
}
Kubernetes DNS provides built-in service discovery. Services get DNS names automatically, and the kube-proxy handles load balancing. For most Kubernetes deployments, you don’t need an external registry.
etcd is a distributed key-value store that can function as a registry. It’s what Kubernetes uses internally and offers strong consistency guarantees.
High availability requires running multiple registry nodes with leader election or consensus protocols. A single registry instance is a single point of failure—exactly what we’re trying to eliminate.
Conclusion
Build a registry only if you’re learning or have extremely specific requirements. For everything else, use Consul, leverage Kubernetes-native discovery, or adopt a service mesh that handles discovery transparently.
The key decisions are: client-side vs. server-side discovery (based on client diversity and performance needs), TTL vs. active health checks (based on failure detection requirements), and self-registration vs. third-party registration (based on how much you want services to know about infrastructure).
Get health checking right. A registry that serves stale data actively harms your system. Start with aggressive TTLs and tune based on your actual failure patterns.