Python - Bytes and Bytearray | Application Architect

Key Insights

Bytes are immutable sequences of integers (0-255) ideal for reading data, while bytearray provides mutability for building or modifying binary content in place
Always specify an encoding explicitly when converting between strings and bytes—implicit encoding assumptions cause subtle bugs across different systems
Use memoryview with bytearray for zero-copy slicing when processing large binary data to avoid memory overhead and improve performance

Introduction to Binary Data in Python

Binary data is everywhere in software engineering. Every file on disk, every network packet, every image and audio stream exists as raw bytes. Python’s text strings (str) handle human-readable text with Unicode support, but they’re the wrong tool for binary protocols, file formats, and low-level I/O.

Python provides two core types for binary data: bytes and bytearray. Understanding when and how to use each is essential for working with files, network sockets, cryptography, serialization, and any domain where you’re dealing with raw data rather than text.

The distinction is simple: text is an abstraction for human communication, while bytes are the actual data your computer processes. Conflating the two leads to encoding bugs, corrupted files, and mysterious failures across different systems.

Understanding Bytes Objects

The bytes type represents an immutable sequence of integers, each ranging from 0 to 255. Think of it as a tuple of small integers with specialized methods for binary operations.

You can create bytes objects in several ways:

# Literal syntax for ASCII-compatible data
data = b'hello world'
print(type(data))  # <class 'bytes'>
print(len(data))   # 11

# From a list of integers
raw = bytes([72, 101, 108, 108, 111])
print(raw)  # b'Hello'

# From a string with explicit encoding
text = "café"
encoded = text.encode('utf-8')
print(encoded)  # b'caf\xc3\xa9'
print(list(encoded))  # [99, 97, 102, 195, 169]

# Create zero-filled bytes
zeroes = bytes(10)
print(zeroes)  # b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

When you access individual elements of a bytes object, you get integers—not single-byte strings:

data = b'ABC'
print(data[0])      # 65 (integer, not b'A')
print(data[0:1])    # b'A' (slice returns bytes)
print(chr(data[0])) # 'A' (convert to character)

# Iteration yields integers
for byte in data:
    print(f"{byte} -> {chr(byte)}")
# 65 -> A
# 66 -> B
# 67 -> C

This integer-based indexing catches many developers off guard. Remember: indexing returns an int, slicing returns bytes.

Working with Bytearray

The bytearray type is the mutable sibling of bytes. Use it when you need to modify binary data in place, build up data incrementally, or work with buffers that receive data over time.

# Create from bytes or string
buffer = bytearray(b'hello')
print(buffer)  # bytearray(b'hello')

# Modify in place
buffer[0] = 72  # Change 'h' to 'H'
print(buffer)  # bytearray(b'Hello')

# Append single bytes
buffer.append(33)  # ASCII for '!'
print(buffer)  # bytearray(b'Hello!')

# Extend with multiple bytes
buffer.extend(b' World')
print(buffer)  # bytearray(b'Hello! World')

# Insert at position
buffer.insert(6, 32)  # Insert space at index 6
print(buffer)  # bytearray(b'Hello!  World')

# Delete bytes
del buffer[6]
print(buffer)  # bytearray(b'Hello! World')

Bytearray supports all the list-like mutation methods: append(), extend(), insert(), pop(), remove(), and reverse(). This makes it ideal for building binary messages incrementally:

def build_packet(command: int, payload: bytes) -> bytes:
    packet = bytearray()
    packet.append(0x02)  # Start byte
    packet.append(command)
    packet.extend(len(payload).to_bytes(2, 'big'))  # Length as 2 bytes
    packet.extend(payload)
    packet.append(0x03)  # End byte
    return bytes(packet)  # Convert to immutable bytes for sending

message = build_packet(0x10, b'sensor_data')
print(message.hex())  # 0210000b73656e736f725f6461746103

Common Operations and Methods

Both bytes and bytearray share many methods with strings, adapted for binary data:

data = b'HTTP/1.1 200 OK\r\nContent-Type: text/html\r\n\r\n<html>'

# Searching
print(data.find(b'200'))     # 9
print(data.index(b'OK'))     # 13
print(b'html' in data)       # True

# Splitting
lines = data.split(b'\r\n')
print(lines[0])  # b'HTTP/1.1 200 OK'
print(lines[1])  # b'Content-Type: text/html'

# Replacing
modified = data.replace(b'200 OK', b'404 Not Found')
print(modified[:30])  # b'HTTP/1.1 404 Not Found\r\nConte'

# Checking content
print(data.startswith(b'HTTP'))  # True
print(data.endswith(b'>'))       # True

Here’s a practical example parsing a binary protocol header:

def parse_header(data: bytes) -> dict:
    """Parse a simple binary protocol header.
    
    Format: [version:1][flags:1][length:2][sequence:4]
    """
    if len(data) < 8:
        raise ValueError("Header too short")
    
    return {
        'version': data[0],
        'flags': data[1],
        'length': int.from_bytes(data[2:4], 'big'),
        'sequence': int.from_bytes(data[4:8], 'big'),
        'payload': data[8:]
    }

# Example packet
packet = bytes([
    0x01,              # version 1
    0x03,              # flags: 0b00000011
    0x00, 0x0C,        # length: 12
    0x00, 0x00, 0x01, 0x5E,  # sequence: 350
]) + b'Hello World!'

header = parse_header(packet)
print(f"Version: {header['version']}")   # Version: 1
print(f"Length: {header['length']}")     # Length: 12
print(f"Sequence: {header['sequence']}") # Sequence: 350
print(f"Payload: {header['payload']}")   # Payload: b'Hello World!'

Encoding and Decoding

Converting between strings and bytes requires explicit encoding. Never assume ASCII—always specify the encoding:

# Encoding: str -> bytes
text = "Hello, 世界!"
utf8_bytes = text.encode('utf-8')
utf16_bytes = text.encode('utf-16')

print(f"UTF-8:  {utf8_bytes}")   # b'Hello, \xe4\xb8\x96\xe7\x95\x8c!'
print(f"UTF-16: {utf16_bytes}")  # b'\xff\xfeH\x00e\x00...'

# Decoding: bytes -> str
decoded = utf8_bytes.decode('utf-8')
print(decoded)  # Hello, 世界!

Encoding errors are common when processing data from external sources. Handle them explicitly:

# Malformed UTF-8 data
bad_data = b'Hello \xff\xfe World'

# Default: raises exception
try:
    bad_data.decode('utf-8')
except UnicodeDecodeError as e:
    print(f"Error: {e}")

# Replace invalid bytes with replacement character
safe = bad_data.decode('utf-8', errors='replace')
print(safe)  # Hello �� World

# Ignore invalid bytes
ignored = bad_data.decode('utf-8', errors='ignore')
print(ignored)  # Hello  World

# Use surrogateescape for round-trip safety
escaped = bad_data.decode('utf-8', errors='surrogateescape')
restored = escaped.encode('utf-8', errors='surrogateescape')
print(restored == bad_data)  # True

Practical Use Cases

Binary file I/O is the most common use case. Always open files in binary mode ('rb' or 'wb') when working with non-text data:

# Reading a binary file
with open('image.png', 'rb') as f:
    header = f.read(8)
    # PNG files start with these magic bytes
    if header[:8] == b'\x89PNG\r\n\x1a\n':
        print("Valid PNG file")

# Writing binary data
with open('output.bin', 'wb') as f:
    f.write(b'\x00\x01\x02\x03')

The struct module packs and unpacks binary data according to format strings:

import struct

# Pack Python values into bytes
# Format: unsigned int, float, 10-byte string
packed = struct.pack('>I f 10s', 42, 3.14, b'hello')
print(packed.hex())  # 0000002a4048f5c368656c6c6f0000000000

# Unpack bytes back to Python values
values = struct.unpack('>I f 10s', packed)
print(values)  # (42, 3.140000104904175, b'hello\x00\x00\x00\x00\x00')

# Network byte order example: IP header fields
def parse_ip_header(data: bytes) -> dict:
    fields = struct.unpack('!BBHHHBBH4s4s', data[:20])
    return {
        'version': fields[0] >> 4,
        'ihl': fields[0] & 0x0F,
        'total_length': fields[2],
        'ttl': fields[5],
        'protocol': fields[6],
        'src_ip': '.'.join(map(str, fields[8])),
        'dst_ip': '.'.join(map(str, fields[9]))
    }

Performance Considerations

Immutability has costs. Concatenating bytes objects creates new objects each time, which becomes expensive in loops:

import time

def concat_bytes(n: int) -> bytes:
    result = b''
    for i in range(n):
        result += bytes([i % 256])
    return result

def append_bytearray(n: int) -> bytes:
    result = bytearray()
    for i in range(n):
        result.append(i % 256)
    return bytes(result)

# Benchmark
n = 100_000

start = time.perf_counter()
concat_bytes(n)
print(f"bytes concat: {time.perf_counter() - start:.3f}s")

start = time.perf_counter()
append_bytearray(n)
print(f"bytearray append: {time.perf_counter() - start:.3f}s")

# Typical output:
# bytes concat: 0.847s
# bytearray append: 0.012s

For large data processing, memoryview provides zero-copy slicing:

# Without memoryview: each slice copies data
data = bytes(10_000_000)  # 10 MB

def process_chunks_copy(data: bytes, chunk_size: int):
    for i in range(0, len(data), chunk_size):
        chunk = data[i:i+chunk_size]  # Creates new bytes object
        _ = sum(chunk)

# With memoryview: no copying
def process_chunks_view(data: bytes, chunk_size: int):
    view = memoryview(data)
    for i in range(0, len(data), chunk_size):
        chunk = view[i:i+chunk_size]  # Zero-copy slice
        _ = sum(chunk)

# memoryview version uses constant memory regardless of data size

Use bytes for data you receive and pass around unchanged. Use bytearray when building or modifying data. Use memoryview when processing large buffers without needing copies. These three types together handle virtually every binary data scenario in Python efficiently.