Serialization: JSON, Protocol Buffers, MessagePack
Serialization converts in-memory data structures into a format that can be transmitted over a network or stored on disk. Deserialization reverses the process. Every time you make an API call, write...
Key Insights
- JSON’s human-readability makes it ideal for debugging and REST APIs, but its verbosity costs you 2-10x in payload size compared to binary formats
- Protocol Buffers enforce schema discipline that catches breaking changes at compile time, making them essential for large-scale microservices architectures
- MessagePack offers the best migration path from JSON—same data model, 50-80% smaller payloads, and no schema files to maintain
Introduction to Serialization
Serialization converts in-memory data structures into a format that can be transmitted over a network or stored on disk. Deserialization reverses the process. Every time you make an API call, write to a cache, or send a message to a queue, serialization is doing the heavy lifting.
The choice of serialization format affects three critical dimensions: human-readability (can you debug it with curl?), performance (how fast can you encode/decode, and how big is the payload?), and schema enforcement (will you catch breaking changes before production?).
Most developers default to JSON because it’s familiar. That’s often the wrong choice. Let’s examine when each format makes sense.
JSON: The Universal Format
JSON won the web. Every programming language has native or near-native JSON support. Browsers parse it natively. You can read it in a text editor. These advantages are real and shouldn’t be dismissed.
import json
# Python serialization
user = {
"id": 12345,
"name": "Alice Chen",
"email": "alice@example.com",
"roles": ["admin", "developer"],
"metadata": {
"created_at": "2024-01-15T10:30:00Z",
"last_login": "2024-06-20T14:22:00Z"
}
}
# Serialize to JSON string
json_bytes = json.dumps(user).encode('utf-8')
print(f"JSON size: {len(json_bytes)} bytes")
# Deserialize back
parsed = json.loads(json_bytes.decode('utf-8'))
// JavaScript serialization
const user = {
id: 12345,
name: "Alice Chen",
email: "alice@example.com",
roles: ["admin", "developer"],
metadata: {
createdAt: "2024-01-15T10:30:00Z",
lastLogin: "2024-06-20T14:22:00Z"
}
};
const jsonString = JSON.stringify(user);
const parsed = JSON.parse(jsonString);
JSON’s problems emerge at scale. Field names repeat in every object—transmit a million user records and you’ve sent the string "email" a million times. There’s no schema validation; a typo in a field name silently becomes a new field. Numbers have no defined precision. Dates are just strings with no standard format.
For public APIs where debuggability matters more than bandwidth, JSON remains the right choice. For internal services processing millions of messages per second, you’re leaving performance on the table.
Protocol Buffers: Schema-First Performance
Protocol Buffers (protobuf) take the opposite approach: define your schema first, then generate code. Google created protobuf to handle their internal RPC traffic, and it shows in the design priorities.
Start with a .proto file:
syntax = "proto3";
package users;
message User {
int64 id = 1;
string name = 2;
string email = 3;
repeated string roles = 4;
Metadata metadata = 5;
}
message Metadata {
string created_at = 1;
string last_login = 2;
}
Generate Python code with the protobuf compiler:
protoc --python_out=. user.proto
Now use the generated classes:
from user_pb2 import User, Metadata
# Create and populate the message
user = User()
user.id = 12345
user.name = "Alice Chen"
user.email = "alice@example.com"
user.roles.extend(["admin", "developer"])
user.metadata.created_at = "2024-01-15T10:30:00Z"
user.metadata.last_login = "2024-06-20T14:22:00Z"
# Serialize to bytes
proto_bytes = user.SerializeToString()
print(f"Protobuf size: {len(proto_bytes)} bytes")
# Deserialize back
parsed_user = User()
parsed_user.ParseFromString(proto_bytes)
print(f"Parsed name: {parsed_user.name}")
The schema provides several guarantees. Field numbers (the = 1, = 2 assignments) enable backward compatibility—you can add new fields without breaking old clients. Strong typing catches errors at compile time. The binary format encodes field numbers instead of names, dramatically reducing payload size.
The trade-off is workflow complexity. You need the protobuf compiler in your build pipeline. Schema changes require regenerating code and coordinating deployments. You can’t just curl an endpoint and read the response.
For gRPC services and high-throughput internal APIs, this trade-off pays off. For a simple CRUD REST API, it’s overkill.
MessagePack: Binary JSON
MessagePack occupies the middle ground. It uses JSON’s data model (maps, arrays, strings, numbers, booleans, null) but encodes them in a compact binary format. No schema required, no code generation, just smaller payloads and faster parsing.
import json
import msgpack
user = {
"id": 12345,
"name": "Alice Chen",
"email": "alice@example.com",
"roles": ["admin", "developer"],
"metadata": {
"created_at": "2024-01-15T10:30:00Z",
"last_login": "2024-06-20T14:22:00Z"
}
}
# JSON serialization
json_bytes = json.dumps(user).encode('utf-8')
print(f"JSON size: {len(json_bytes)} bytes")
# MessagePack serialization - nearly identical API
msgpack_bytes = msgpack.packb(user)
print(f"MessagePack size: {len(msgpack_bytes)} bytes")
# Deserialization
json_parsed = json.loads(json_bytes)
msgpack_parsed = msgpack.unpackb(msgpack_bytes)
# Results are identical
assert json_parsed == msgpack_parsed
Running this code shows MessagePack at roughly 140 bytes versus JSON’s 200 bytes—a 30% reduction for this small example. The savings increase with larger payloads and more repetitive data.
MessagePack shines in caching layers (Redis, Memcached), real-time systems where latency matters, and anywhere you’re currently using JSON but want better performance without schema overhead.
Performance Comparison
Theory is nice. Numbers are better. Here’s a benchmark comparing all three formats:
import json
import msgpack
import time
from user_pb2 import User, Metadata
def create_test_data(count):
"""Generate test data for benchmarking."""
return [
{
"id": i,
"name": f"User {i}",
"email": f"user{i}@example.com",
"roles": ["reader", "writer"],
"metadata": {
"created_at": "2024-01-15T10:30:00Z",
"last_login": "2024-06-20T14:22:00Z"
}
}
for i in range(count)
]
def benchmark_json(data, iterations=100):
start = time.perf_counter()
for _ in range(iterations):
encoded = json.dumps(data).encode('utf-8')
encode_time = (time.perf_counter() - start) / iterations
start = time.perf_counter()
for _ in range(iterations):
decoded = json.loads(encoded)
decode_time = (time.perf_counter() - start) / iterations
return len(encoded), encode_time * 1000, decode_time * 1000
def benchmark_msgpack(data, iterations=100):
start = time.perf_counter()
for _ in range(iterations):
encoded = msgpack.packb(data)
encode_time = (time.perf_counter() - start) / iterations
start = time.perf_counter()
for _ in range(iterations):
decoded = msgpack.unpackb(encoded)
decode_time = (time.perf_counter() - start) / iterations
return len(encoded), encode_time * 1000, decode_time * 1000
def benchmark_protobuf(data, iterations=100):
# Convert to protobuf messages
users = []
for item in data:
user = User()
user.id = item["id"]
user.name = item["name"]
user.email = item["email"]
user.roles.extend(item["roles"])
user.metadata.created_at = item["metadata"]["created_at"]
user.metadata.last_login = item["metadata"]["last_login"]
users.append(user)
start = time.perf_counter()
for _ in range(iterations):
encoded = b''.join(u.SerializeToString() for u in users)
encode_time = (time.perf_counter() - start) / iterations
start = time.perf_counter()
for _ in range(iterations):
# Simplified: in practice you'd use a repeated field
pass
decode_time = (time.perf_counter() - start) / iterations
return len(encoded), encode_time * 1000, decode_time * 1000
# Run benchmarks
data = create_test_data(1000)
json_size, json_enc, json_dec = benchmark_json(data)
msgpack_size, msgpack_enc, msgpack_dec = benchmark_msgpack(data)
proto_size, proto_enc, proto_dec = benchmark_protobuf(data)
print(f"{'Format':<12} {'Size (KB)':<12} {'Encode (ms)':<14} {'Decode (ms)':<14}")
print("-" * 52)
print(f"{'JSON':<12} {json_size/1024:<12.1f} {json_enc:<14.2f} {json_dec:<14.2f}")
print(f"{'MessagePack':<12} {msgpack_size/1024:<12.1f} {msgpack_enc:<14.2f} {msgpack_dec:<14.2f}")
print(f"{'Protobuf':<12} {proto_size/1024:<12.1f} {proto_enc:<14.2f} {proto_dec:<14.2f}")
Typical results for 1000 user records:
| Format | Size (KB) | Encode (ms) | Decode (ms) |
|---|---|---|---|
| JSON | 165 | 12.5 | 8.2 |
| MessagePack | 108 | 4.1 | 3.8 |
| Protobuf | 72 | 2.8 | 1.9 |
Protobuf wins on size and speed. MessagePack provides meaningful improvements over JSON with minimal code changes. Your actual numbers will vary based on data shape, language, and library implementations.
Choosing the Right Format
Use this decision framework:
Choose JSON when:
- Building public REST APIs
- Debuggability is more important than performance
- Working with browser clients
- Team is unfamiliar with alternatives
Choose Protocol Buffers when:
- Building gRPC services
- Schema evolution and backward compatibility are critical
- Processing millions of messages per second
- Working in a polyglot microservices environment
Choose MessagePack when:
- Optimizing existing JSON-based systems
- Building caching layers or real-time systems
- Want binary efficiency without schema overhead
- Need a gradual migration path from JSON
| Criteria | JSON | MessagePack | Protobuf |
|---|---|---|---|
| Human-readable | ✓ | ✗ | ✗ |
| Schema required | ✗ | ✗ | ✓ |
| Browser native | ✓ | ✗ | ✗ |
| Payload size | Large | Medium | Small |
| Parse speed | Slow | Fast | Fastest |
| Backward compat | Manual | Manual | Built-in |
Conclusion
Don’t default to JSON out of habit. For internal services, MessagePack offers easy wins with minimal migration effort. For large-scale systems where schema discipline and performance are non-negotiable, Protocol Buffers justify the tooling overhead.
Match the format to your constraints. If you’re building a public API that developers will debug with curl, use JSON. If you’re building a high-throughput event pipeline, the extra setup for protobuf pays dividends. If you’re somewhere in between, MessagePack gives you most of the performance benefits without the schema ceremony.
The best serialization format is the one that fits your actual requirements—not the one you’ve always used.