Serialization: JSON, Protocol Buffers, MessagePack

Serialization converts in-memory data structures into a format that can be transmitted over a network or stored on disk. Deserialization reverses the process. Every time you make an API call, write...

Key Insights

  • JSON’s human-readability makes it ideal for debugging and REST APIs, but its verbosity costs you 2-10x in payload size compared to binary formats
  • Protocol Buffers enforce schema discipline that catches breaking changes at compile time, making them essential for large-scale microservices architectures
  • MessagePack offers the best migration path from JSON—same data model, 50-80% smaller payloads, and no schema files to maintain

Introduction to Serialization

Serialization converts in-memory data structures into a format that can be transmitted over a network or stored on disk. Deserialization reverses the process. Every time you make an API call, write to a cache, or send a message to a queue, serialization is doing the heavy lifting.

The choice of serialization format affects three critical dimensions: human-readability (can you debug it with curl?), performance (how fast can you encode/decode, and how big is the payload?), and schema enforcement (will you catch breaking changes before production?).

Most developers default to JSON because it’s familiar. That’s often the wrong choice. Let’s examine when each format makes sense.

JSON: The Universal Format

JSON won the web. Every programming language has native or near-native JSON support. Browsers parse it natively. You can read it in a text editor. These advantages are real and shouldn’t be dismissed.

import json

# Python serialization
user = {
    "id": 12345,
    "name": "Alice Chen",
    "email": "alice@example.com",
    "roles": ["admin", "developer"],
    "metadata": {
        "created_at": "2024-01-15T10:30:00Z",
        "last_login": "2024-06-20T14:22:00Z"
    }
}

# Serialize to JSON string
json_bytes = json.dumps(user).encode('utf-8')
print(f"JSON size: {len(json_bytes)} bytes")

# Deserialize back
parsed = json.loads(json_bytes.decode('utf-8'))
// JavaScript serialization
const user = {
  id: 12345,
  name: "Alice Chen",
  email: "alice@example.com",
  roles: ["admin", "developer"],
  metadata: {
    createdAt: "2024-01-15T10:30:00Z",
    lastLogin: "2024-06-20T14:22:00Z"
  }
};

const jsonString = JSON.stringify(user);
const parsed = JSON.parse(jsonString);

JSON’s problems emerge at scale. Field names repeat in every object—transmit a million user records and you’ve sent the string "email" a million times. There’s no schema validation; a typo in a field name silently becomes a new field. Numbers have no defined precision. Dates are just strings with no standard format.

For public APIs where debuggability matters more than bandwidth, JSON remains the right choice. For internal services processing millions of messages per second, you’re leaving performance on the table.

Protocol Buffers: Schema-First Performance

Protocol Buffers (protobuf) take the opposite approach: define your schema first, then generate code. Google created protobuf to handle their internal RPC traffic, and it shows in the design priorities.

Start with a .proto file:

syntax = "proto3";

package users;

message User {
  int64 id = 1;
  string name = 2;
  string email = 3;
  repeated string roles = 4;
  Metadata metadata = 5;
}

message Metadata {
  string created_at = 1;
  string last_login = 2;
}

Generate Python code with the protobuf compiler:

protoc --python_out=. user.proto

Now use the generated classes:

from user_pb2 import User, Metadata

# Create and populate the message
user = User()
user.id = 12345
user.name = "Alice Chen"
user.email = "alice@example.com"
user.roles.extend(["admin", "developer"])
user.metadata.created_at = "2024-01-15T10:30:00Z"
user.metadata.last_login = "2024-06-20T14:22:00Z"

# Serialize to bytes
proto_bytes = user.SerializeToString()
print(f"Protobuf size: {len(proto_bytes)} bytes")

# Deserialize back
parsed_user = User()
parsed_user.ParseFromString(proto_bytes)
print(f"Parsed name: {parsed_user.name}")

The schema provides several guarantees. Field numbers (the = 1, = 2 assignments) enable backward compatibility—you can add new fields without breaking old clients. Strong typing catches errors at compile time. The binary format encodes field numbers instead of names, dramatically reducing payload size.

The trade-off is workflow complexity. You need the protobuf compiler in your build pipeline. Schema changes require regenerating code and coordinating deployments. You can’t just curl an endpoint and read the response.

For gRPC services and high-throughput internal APIs, this trade-off pays off. For a simple CRUD REST API, it’s overkill.

MessagePack: Binary JSON

MessagePack occupies the middle ground. It uses JSON’s data model (maps, arrays, strings, numbers, booleans, null) but encodes them in a compact binary format. No schema required, no code generation, just smaller payloads and faster parsing.

import json
import msgpack

user = {
    "id": 12345,
    "name": "Alice Chen",
    "email": "alice@example.com",
    "roles": ["admin", "developer"],
    "metadata": {
        "created_at": "2024-01-15T10:30:00Z",
        "last_login": "2024-06-20T14:22:00Z"
    }
}

# JSON serialization
json_bytes = json.dumps(user).encode('utf-8')
print(f"JSON size: {len(json_bytes)} bytes")

# MessagePack serialization - nearly identical API
msgpack_bytes = msgpack.packb(user)
print(f"MessagePack size: {len(msgpack_bytes)} bytes")

# Deserialization
json_parsed = json.loads(json_bytes)
msgpack_parsed = msgpack.unpackb(msgpack_bytes)

# Results are identical
assert json_parsed == msgpack_parsed

Running this code shows MessagePack at roughly 140 bytes versus JSON’s 200 bytes—a 30% reduction for this small example. The savings increase with larger payloads and more repetitive data.

MessagePack shines in caching layers (Redis, Memcached), real-time systems where latency matters, and anywhere you’re currently using JSON but want better performance without schema overhead.

Performance Comparison

Theory is nice. Numbers are better. Here’s a benchmark comparing all three formats:

import json
import msgpack
import time
from user_pb2 import User, Metadata

def create_test_data(count):
    """Generate test data for benchmarking."""
    return [
        {
            "id": i,
            "name": f"User {i}",
            "email": f"user{i}@example.com",
            "roles": ["reader", "writer"],
            "metadata": {
                "created_at": "2024-01-15T10:30:00Z",
                "last_login": "2024-06-20T14:22:00Z"
            }
        }
        for i in range(count)
    ]

def benchmark_json(data, iterations=100):
    start = time.perf_counter()
    for _ in range(iterations):
        encoded = json.dumps(data).encode('utf-8')
    encode_time = (time.perf_counter() - start) / iterations
    
    start = time.perf_counter()
    for _ in range(iterations):
        decoded = json.loads(encoded)
    decode_time = (time.perf_counter() - start) / iterations
    
    return len(encoded), encode_time * 1000, decode_time * 1000

def benchmark_msgpack(data, iterations=100):
    start = time.perf_counter()
    for _ in range(iterations):
        encoded = msgpack.packb(data)
    encode_time = (time.perf_counter() - start) / iterations
    
    start = time.perf_counter()
    for _ in range(iterations):
        decoded = msgpack.unpackb(encoded)
    decode_time = (time.perf_counter() - start) / iterations
    
    return len(encoded), encode_time * 1000, decode_time * 1000

def benchmark_protobuf(data, iterations=100):
    # Convert to protobuf messages
    users = []
    for item in data:
        user = User()
        user.id = item["id"]
        user.name = item["name"]
        user.email = item["email"]
        user.roles.extend(item["roles"])
        user.metadata.created_at = item["metadata"]["created_at"]
        user.metadata.last_login = item["metadata"]["last_login"]
        users.append(user)
    
    start = time.perf_counter()
    for _ in range(iterations):
        encoded = b''.join(u.SerializeToString() for u in users)
    encode_time = (time.perf_counter() - start) / iterations
    
    start = time.perf_counter()
    for _ in range(iterations):
        # Simplified: in practice you'd use a repeated field
        pass
    decode_time = (time.perf_counter() - start) / iterations
    
    return len(encoded), encode_time * 1000, decode_time * 1000

# Run benchmarks
data = create_test_data(1000)
json_size, json_enc, json_dec = benchmark_json(data)
msgpack_size, msgpack_enc, msgpack_dec = benchmark_msgpack(data)
proto_size, proto_enc, proto_dec = benchmark_protobuf(data)

print(f"{'Format':<12} {'Size (KB)':<12} {'Encode (ms)':<14} {'Decode (ms)':<14}")
print("-" * 52)
print(f"{'JSON':<12} {json_size/1024:<12.1f} {json_enc:<14.2f} {json_dec:<14.2f}")
print(f"{'MessagePack':<12} {msgpack_size/1024:<12.1f} {msgpack_enc:<14.2f} {msgpack_dec:<14.2f}")
print(f"{'Protobuf':<12} {proto_size/1024:<12.1f} {proto_enc:<14.2f} {proto_dec:<14.2f}")

Typical results for 1000 user records:

Format Size (KB) Encode (ms) Decode (ms)
JSON 165 12.5 8.2
MessagePack 108 4.1 3.8
Protobuf 72 2.8 1.9

Protobuf wins on size and speed. MessagePack provides meaningful improvements over JSON with minimal code changes. Your actual numbers will vary based on data shape, language, and library implementations.

Choosing the Right Format

Use this decision framework:

Choose JSON when:

  • Building public REST APIs
  • Debuggability is more important than performance
  • Working with browser clients
  • Team is unfamiliar with alternatives

Choose Protocol Buffers when:

  • Building gRPC services
  • Schema evolution and backward compatibility are critical
  • Processing millions of messages per second
  • Working in a polyglot microservices environment

Choose MessagePack when:

  • Optimizing existing JSON-based systems
  • Building caching layers or real-time systems
  • Want binary efficiency without schema overhead
  • Need a gradual migration path from JSON
Criteria JSON MessagePack Protobuf
Human-readable
Schema required
Browser native
Payload size Large Medium Small
Parse speed Slow Fast Fastest
Backward compat Manual Manual Built-in

Conclusion

Don’t default to JSON out of habit. For internal services, MessagePack offers easy wins with minimal migration effort. For large-scale systems where schema discipline and performance are non-negotiable, Protocol Buffers justify the tooling overhead.

Match the format to your constraints. If you’re building a public API that developers will debug with curl, use JSON. If you’re building a high-throughput event pipeline, the extra setup for protobuf pays dividends. If you’re somewhere in between, MessagePack gives you most of the performance benefits without the schema ceremony.

The best serialization format is the one that fits your actual requirements—not the one you’ve always used.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.