NumPy - Structured Arrays (Record Arrays)

Key Insights

• Structured arrays allow you to store heterogeneous data types in a single NumPy array, similar to database tables or DataFrames, while maintaining NumPy’s performance advantages • Field access uses dictionary-style syntax or attribute-style notation, making structured arrays intuitive for working with tabular data without pandas overhead • Record arrays are a subclass of structured arrays that enable dot notation for field access, though they come with slight performance trade-offs for convenience

Understanding Structured Arrays

Structured arrays in NumPy let you define custom data types with named fields, each having its own data type. This is particularly useful when working with heterogeneous data that would traditionally require separate arrays or a full DataFrame library.

import numpy as np

# Define a structured array type
dt = np.dtype([('name', 'U20'), ('age', 'i4'), ('salary', 'f8')])

# Create an array with this structure
employees = np.array([
    ('Alice Johnson', 32, 75000.50),
    ('Bob Smith', 28, 62000.00),
    ('Carol White', 45, 95000.75)
], dtype=dt)

print(employees)
# [('Alice Johnson', 32, 75000.5) ('Bob Smith', 28, 62000.)
#  ('Carol White', 45, 95000.75)]

print(employees['name'])
# ['Alice Johnson' 'Bob Smith' 'Carol White']

print(employees['salary'])
# [75000.5 62000.  95000.75]

The dtype specification uses tuples where each element defines a field: (field_name, data_type). Common data type codes include 'i4' (32-bit integer), 'f8' (64-bit float), and 'U20' (Unicode string up to 20 characters).

Creating Structured Arrays

NumPy provides multiple ways to create structured arrays depending on your data source and requirements.

# Method 1: Using zeros with a structured dtype
dt = np.dtype([('x', 'f4'), ('y', 'f4'), ('label', 'U10')])
points = np.zeros(5, dtype=dt)
points['x'] = [1.0, 2.5, 3.2, 4.8, 5.1]
points['y'] = [2.3, 1.8, 4.5, 3.2, 2.9]
points['label'] = ['A', 'B', 'C', 'D', 'E']

# Method 2: From existing arrays
names = np.array(['John', 'Jane', 'Joe'])
ages = np.array([25, 30, 35])
scores = np.array([88.5, 92.3, 85.7])

# Using np.rec.fromarrays (creates a record array)
students = np.rec.fromarrays(
    [names, ages, scores],
    names='name,age,score'
)

# Method 3: Using dictionary-style dtype
dt = {
    'names': ['product', 'quantity', 'price'],
    'formats': ['U30', 'i4', 'f8']
}
inventory = np.array([
    ('Laptop', 15, 899.99),
    ('Mouse', 50, 24.99),
    ('Keyboard', 30, 79.99)
], dtype=dt)

print(inventory['product'])
# ['Laptop' 'Mouse' 'Keyboard']

Record Arrays vs Structured Arrays

Record arrays are a specialized subclass that enables attribute-style access to fields, making code more readable at a small performance cost.

# Create a record array
dt = np.dtype([('id', 'i4'), ('temperature', 'f8'), ('humidity', 'f8')])
sensors = np.rec.array([
    (1, 22.5, 65.2),
    (2, 23.1, 62.8),
    (3, 21.9, 68.5)
], dtype=dt)

# Attribute-style access (only works with record arrays)
print(sensors.temperature)
# [22.5 23.1 21.9]

# Still works with bracket notation
print(sensors['humidity'])
# [65.2 62.8 68.5]

# Convert structured array to record array
regular_array = np.array([(1, 2.5), (2, 3.7)], dtype=[('a', 'i4'), ('b', 'f8')])
rec_array = regular_array.view(np.recarray)
print(rec_array.a)
# [1 2]

Advanced Field Operations

Structured arrays support sophisticated field manipulation, including accessing multiple fields simultaneously and working with nested structures.

# Multiple field access
dt = np.dtype([('name', 'U20'), ('age', 'i4'), ('salary', 'f8'), ('dept', 'U10')])
staff = np.array([
    ('Alice', 32, 75000, 'IT'),
    ('Bob', 28, 62000, 'HR'),
    ('Carol', 45, 95000, 'IT')
], dtype=dt)

# Access multiple fields
print(staff[['name', 'salary']])
# [('Alice', 75000.) ('Bob', 62000.) ('Carol', 95000.)]

# Nested structures
dt_nested = np.dtype([
    ('name', 'U20'),
    ('address', [('street', 'U30'), ('city', 'U20'), ('zip', 'U10')])
])

people = np.array([
    ('John', ('123 Main St', 'Boston', '02101')),
    ('Jane', ('456 Oak Ave', 'Seattle', '98101'))
], dtype=dt_nested)

print(people['address']['city'])
# ['Boston' 'Seattle']

Practical Use Cases

Structured arrays excel in scenarios requiring efficient memory usage with heterogeneous data types.

# Time series data with metadata
dt = np.dtype([
    ('timestamp', 'datetime64[s]'),
    ('sensor_id', 'i4'),
    ('value', 'f8'),
    ('quality', 'U10')
])

readings = np.array([
    (np.datetime64('2024-01-15T10:00:00'), 1, 23.5, 'good'),
    (np.datetime64('2024-01-15T10:01:00'), 1, 23.7, 'good'),
    (np.datetime64('2024-01-15T10:02:00'), 1, 24.1, 'suspect'),
    (np.datetime64('2024-01-15T10:03:00'), 2, 22.9, 'good')
], dtype=dt)

# Filter by quality
good_readings = readings[readings['quality'] == 'good']
print(f"Good readings: {len(good_readings)}")

# Calculate statistics
print(f"Average value: {good_readings['value'].mean():.2f}")

# Binary file I/O with structured arrays
dt_binary = np.dtype([('id', 'i4'), ('x', 'f8'), ('y', 'f8')])
data = np.array([(1, 1.5, 2.3), (2, 3.7, 4.2)], dtype=dt_binary)

# Write to binary file
data.tofile('coordinates.bin')

# Read back
loaded = np.fromfile('coordinates.bin', dtype=dt_binary)
print(loaded)

Performance Considerations

Structured arrays provide memory efficiency and performance benefits over alternative approaches.

import time

# Compare memory usage: structured array vs separate arrays
n = 1000000

# Structured array
dt = np.dtype([('a', 'i4'), ('b', 'f8'), ('c', 'i4')])
structured = np.zeros(n, dtype=dt)

# Separate arrays
a = np.zeros(n, dtype='i4')
b = np.zeros(n, dtype='f8')
c = np.zeros(n, dtype='i4')

print(f"Structured array size: {structured.nbytes / 1024 / 1024:.2f} MB")
print(f"Separate arrays size: {(a.nbytes + b.nbytes + c.nbytes) / 1024 / 1024:.2f} MB")

# Performance test: field access
structured['a'] = np.arange(n)
structured['b'] = np.random.random(n)

start = time.time()
result = structured['a'] * structured['b']
structured_time = time.time() - start

a = np.arange(n)
b = np.random.random(n)

start = time.time()
result = a * b
separate_time = time.time() - start

print(f"Structured array operation: {structured_time:.4f}s")
print(f"Separate arrays operation: {separate_time:.4f}s")

Sorting and Indexing

Structured arrays support sorting on single or multiple fields with NumPy’s standard sorting functions.

dt = np.dtype([('name', 'U20'), ('dept', 'U10'), ('salary', 'f8')])
employees = np.array([
    ('Alice', 'IT', 75000),
    ('Bob', 'HR', 62000),
    ('Carol', 'IT', 95000),
    ('David', 'HR', 58000)
], dtype=dt)

# Sort by single field
sorted_by_salary = np.sort(employees, order='salary')
print(sorted_by_salary)

# Sort by multiple fields (department, then salary)
sorted_multi = np.sort(employees, order=['dept', 'salary'])
print(sorted_multi)

# Boolean indexing
high_earners = employees[employees['salary'] > 70000]
print(high_earners['name'])

# Fancy indexing with structured arrays
it_dept = employees[employees['dept'] == 'IT']
print(f"IT Department average salary: {it_dept['salary'].mean():.2f}")

Structured arrays bridge the gap between raw NumPy arrays and full-featured DataFrames, offering a lightweight solution for heterogeneous data when pandas would be overkill. They’re particularly valuable in scientific computing, data acquisition systems, and scenarios requiring precise memory control with mixed data types.