NumPy: Structured Arrays Guide

Key Insights

Structured arrays let you store heterogeneous data (strings, integers, floats) in a single NumPy array with named fields, combining the memory efficiency of arrays with the convenience of labeled columns.
They’re ideal for memory-constrained environments, binary file I/O, and interfacing with C libraries—situations where pandas adds unnecessary overhead.
Understanding dtype definitions and memory layout is crucial; poorly sized string fields and alignment issues are the most common sources of bugs and wasted memory.

Introduction to Structured Arrays

NumPy’s structured arrays solve a fundamental limitation of regular arrays: they can only hold one data type. When you need to store records with mixed types—like employee data with names, ages, and salaries—you’d typically reach for a pandas DataFrame. But DataFrames come with overhead that isn’t always justified.

Structured arrays occupy the middle ground. They give you named fields and heterogeneous data types while maintaining NumPy’s memory efficiency and C-level performance. Think of them as a lightweight, array-based alternative to DataFrames.

import numpy as np

# Regular arrays: one type only
ages = np.array([25, 30, 35])
salaries = np.array([50000.0, 65000.0, 80000.0])
# Names? You'd need a separate array or object dtype (slow)

# Structured array: mixed types, single array
employees = np.array([
    ('Alice', 25, 50000.0),
    ('Bob', 30, 65000.0),
    ('Carol', 35, 80000.0)
], dtype=[('name', 'U10'), ('age', 'i4'), ('salary', 'f8')])

print(employees['name'])   # ['Alice' 'Bob' 'Carol']
print(employees['salary']) # [50000. 65000. 80000.]
print(employees[0])        # ('Alice', 25, 50000.)

Use structured arrays when you need memory efficiency over convenience, when interfacing with binary file formats, or when passing data to C extensions. Use pandas when you need rich data manipulation, groupby operations, or time series functionality.

Defining Data Types (dtypes)

The dtype definition is where structured arrays either click or confuse. NumPy offers multiple syntaxes for the same result—pick one and stick with it.

# Method 1: List of tuples (most common, recommended)
dtype1 = [('name', 'U20'), ('age', 'i4'), ('salary', 'f8')]

# Method 2: Dictionary with 'names' and 'formats'
dtype2 = {'names': ['name', 'age', 'salary'],
          'formats': ['U20', 'i4', 'f8']}

# Method 3: Comma-separated string (quick but less readable)
dtype3 = 'U20, i4, f8'  # Fields named f0, f1, f2

# Method 4: np.dtype with explicit field definitions
dtype4 = np.dtype([('name', np.unicode_, 20), 
                   ('age', np.int32), 
                   ('salary', np.float64)])

# All create equivalent structures (except method 3's field names)
arr1 = np.zeros(3, dtype=dtype1)
arr2 = np.zeros(3, dtype=dtype2)

Common type codes you’ll use:

Code	Type	Example
`'i4'`	32-bit integer	`np.int32`
`'i8'`	64-bit integer	`np.int64`
`'f4'`	32-bit float	`np.float32`
`'f8'`	64-bit float	`np.float64`
`'U10'`	Unicode string (10 chars)	Fixed-width string
`'S10'`	Byte string (10 bytes)	ASCII only
`'?'`	Boolean	`np.bool_`

The U prefix creates Unicode strings; the number specifies maximum characters. This is a fixed allocation—'U10' always uses 40 bytes (4 bytes per character) regardless of actual string length.

Creating Structured Arrays

Once you have a dtype, creating arrays is straightforward. Choose your method based on your data source.

# Define our employee dtype
emp_dtype = [('name', 'U15'), ('age', 'i4'), ('salary', 'f8'), ('active', '?')]

# From scratch: preallocate with zeros
employees = np.zeros(5, dtype=emp_dtype)
employees[0] = ('Alice', 28, 72000.0, True)
employees[1] = ('Bob', 34, 85000.0, True)

# From list of tuples: most common approach
employees = np.array([
    ('Alice', 28, 72000.0, True),
    ('Bob', 34, 85000.0, True),
    ('Carol', 45, 120000.0, True),
    ('David', 29, 68000.0, False),
    ('Eve', 31, 91000.0, True)
], dtype=emp_dtype)

# From separate arrays: use np.rec.fromarrays
names = np.array(['Alice', 'Bob', 'Carol'])
ages = np.array([28, 34, 45])
salaries = np.array([72000.0, 85000.0, 120000.0])
active = np.array([True, True, True])

employees = np.rec.fromarrays(
    [names, ages, salaries, active],
    dtype=emp_dtype
)

The np.recarray variant provides attribute-style access, which some find more readable:

# Record array: access fields as attributes
employees = np.rec.array([
    ('Alice', 28, 72000.0, True),
    ('Bob', 34, 85000.0, True),
], dtype=emp_dtype)

print(employees.name)    # ['Alice' 'Bob'] - attribute access
print(employees['name']) # ['Alice' 'Bob'] - still works

Accessing and Modifying Data

Structured arrays support both field-based (column) and index-based (row) access. This dual nature makes them flexible for different operations.

employees = np.array([
    ('Alice', 28, 72000.0, True),
    ('Bob', 34, 85000.0, True),
    ('Carol', 45, 120000.0, True),
    ('David', 29, 68000.0, False),
    ('Eve', 31, 91000.0, True)
], dtype=[('name', 'U15'), ('age', 'i4'), ('salary', 'f8'), ('active', '?')])

# Field access (returns a view, not a copy)
all_salaries = employees['salary']
print(all_salaries)  # [72000. 85000. 120000. 68000. 91000.]

# Row access
first_employee = employees[0]
print(first_employee)  # ('Alice', 28, 72000., True)

# Slicing works as expected
senior_staff = employees[2:5]

# Boolean masking: find high earners
high_earners = employees[employees['salary'] > 80000]
print(high_earners['name'])  # ['Bob' 'Carol' 'Eve']

# Combined conditions
active_high_earners = employees[
    (employees['salary'] > 80000) & (employees['active'] == True)
]

# Modify a field for all records
employees['salary'] *= 1.05  # 5% raise for everyone

# Modify specific records
employees['salary'][employees['name'] == 'David'] = 75000.0

# Update entire record
employees[0] = ('Alice Smith', 29, 78000.0, True)

One critical detail: field access returns a view, not a copy. Modifications affect the original array:

salaries = employees['salary']
salaries[0] = 999999.0
print(employees[0]['salary'])  # 999999.0 - original modified!

Advanced Features

Structured arrays support nesting, letting you create complex hierarchical data structures.

# Nested dtype: person with embedded address
address_dtype = [('street', 'U30'), ('city', 'U20'), ('zip', 'U10')]
person_dtype = [
    ('name', 'U20'),
    ('age', 'i4'),
    ('address', address_dtype)  # Nested structure
]

people = np.zeros(2, dtype=person_dtype)
people[0] = ('Alice', 28, ('123 Main St', 'Boston', '02101'))
people[1] = ('Bob', 34, ('456 Oak Ave', 'Seattle', '98101'))

# Access nested fields
print(people['address']['city'])  # ['Boston' 'Seattle']
print(people[0]['address']['street'])  # '123 Main St'

Memory alignment matters for performance and C interoperability. By default, NumPy may add padding bytes between fields:

# Check actual memory layout
dt = np.dtype([('x', 'i1'), ('y', 'f8')])  # 1-byte int, 8-byte float
print(dt.itemsize)  # 16 bytes (with padding), not 9

# Force packed layout (no padding)
dt_packed = np.dtype([('x', 'i1'), ('y', 'f8')], align=False)
print(dt_packed.itemsize)  # 9 bytes

File I/O and Interoperability

Structured arrays excel at binary file operations—reading and writing is fast and preserves exact memory layout.

# Create sample data
sensor_dtype = [('timestamp', 'f8'), ('temperature', 'f4'), ('humidity', 'f4')]
readings = np.array([
    (1609459200.0, 22.5, 45.0),
    (1609459260.0, 22.7, 44.8),
    (1609459320.0, 22.6, 45.2),
], dtype=sensor_dtype)

# Save to binary file
readings.tofile('sensor_data.bin')

# Load from binary file (must specify dtype)
loaded = np.fromfile('sensor_data.bin', dtype=sensor_dtype)
print(np.array_equal(readings, loaded))  # True

# CSV import with genfromtxt
# Assuming CSV: name,age,salary
data = np.genfromtxt(
    'employees.csv',
    delimiter=',',
    dtype=[('name', 'U20'), ('age', 'i4'), ('salary', 'f8')],
    skip_header=1
)

Converting to and from pandas is straightforward:

import pandas as pd

# Structured array to DataFrame
df = pd.DataFrame(employees)

# DataFrame to structured array
records = df.to_records(index=False)
# Note: to_records returns a recarray with potentially different dtypes

Best Practices and When to Use

Choose structured arrays when:

Memory is constrained and you need compact storage
You’re doing binary file I/O with fixed-format records
You’re interfacing with C code or external libraries
Your data manipulation needs are simple (filtering, field access)

Choose pandas when:

You need groupby, merge, pivot, or time series operations
You’re doing exploratory data analysis
You need flexible string handling
Memory isn’t a primary concern

import sys

# Memory comparison
n = 100000
emp_dtype = [('name', 'U15'), ('age', 'i4'), ('salary', 'f8')]

# Structured array
struct_arr = np.zeros(n, dtype=emp_dtype)
struct_size = struct_arr.nbytes

# Pandas DataFrame
df = pd.DataFrame({
    'name': [''] * n,
    'age': np.zeros(n, dtype='i4'),
    'salary': np.zeros(n, dtype='f8')
})
df_size = df.memory_usage(deep=True).sum()

print(f"Structured array: {struct_size / 1024:.1f} KB")
print(f"DataFrame: {df_size / 1024:.1f} KB")
# Structured array: ~7200 KB, DataFrame: ~8500+ KB

Common pitfalls to avoid:

Undersized string fields: 'U5' truncates “Alexander” to “Alexa” silently
Forgetting dtype on load: np.fromfile() without dtype returns garbage
Assuming copies: Field access returns views; use .copy() when needed
Ignoring alignment: Packed structures may cause performance issues on some architectures

Structured arrays aren’t glamorous, but they’re a powerful tool when you need the performance characteristics of arrays with the organizational benefits of named fields. Master the dtype syntax, understand the memory implications, and you’ll have another effective tool in your NumPy arsenal.