How to Save and Load Arrays in NumPy
Persisting NumPy arrays to disk is a fundamental operation in data science and scientific computing workflows. Whether you're checkpointing intermediate results in a data pipeline, saving trained...
Key Insights
- Use
.npyfor single arrays and.npzfor multiple arrays—both are binary formats optimized for speed and storage efficiency - Text formats like
savetxtsacrifice performance for human readability and interoperability with non-Python tools - Memory mapping with
mmap_modeis essential when working with arrays larger than available RAM
Persisting NumPy arrays to disk is a fundamental operation in data science and scientific computing workflows. Whether you’re checkpointing intermediate results in a data pipeline, saving trained model weights, or sharing datasets with colleagues, you need reliable methods to serialize and deserialize array data.
NumPy provides several built-in approaches, each with distinct trade-offs. Binary formats offer speed and compact storage but require NumPy to read. Text formats provide human readability and broad compatibility at the cost of performance. Understanding when to use each approach will save you from debugging mysterious dtype mismatches and waiting on unnecessarily slow I/O operations.
Saving and Loading Single Arrays with .npy Format
The .npy format is NumPy’s native binary format for single arrays. It stores the array data along with metadata about dtype, shape, and byte order, enabling perfect reconstruction on load.
Use np.save() to write and np.load() to read:
import numpy as np
# Create a sample 2D array
data = np.array([
[1.5, 2.3, 3.7],
[4.1, 5.9, 6.2],
[7.8, 8.4, 9.0]
], dtype=np.float64)
# Save to disk
np.save('matrix.npy', data)
# Load it back
loaded_data = np.load('matrix.npy')
# Verify integrity
print(f"Shape preserved: {data.shape == loaded_data.shape}")
print(f"Dtype preserved: {data.dtype == loaded_data.dtype}")
print(f"Data identical: {np.array_equal(data, loaded_data)}")
The .npy extension is conventional but not enforced—NumPy will save to whatever filename you provide. However, sticking with .npy makes file purposes obvious and helps tooling recognize the format.
One detail worth noting: np.save() automatically appends .npy if you omit the extension, but np.load() does not. This asymmetry can cause confusion:
# This creates 'matrix.npy'
np.save('matrix', data)
# This fails - file not found
# np.load('matrix') # Wrong!
# This works
loaded = np.load('matrix.npy')
Be explicit with extensions to avoid this gotcha.
Saving Multiple Arrays with .npz Format
When you need to persist multiple related arrays, .npz files bundle them into a single archive. Think of it as a zip file containing multiple .npy files, accessible by name.
np.savez() creates an uncompressed archive, while np.savez_compressed() applies zip compression:
import numpy as np
# Multiple arrays representing a dataset
features = np.random.randn(1000, 50).astype(np.float32)
labels = np.random.randint(0, 10, size=1000).astype(np.int32)
metadata = np.array(['train', 'v1.0', '2024-01-15'], dtype='U20')
# Save as uncompressed archive
np.savez('dataset.npz',
features=features,
labels=labels,
metadata=metadata)
# Or save with compression (slower write, smaller file)
np.savez_compressed('dataset_compressed.npz',
features=features,
labels=labels,
metadata=metadata)
# Load and access by key
with np.load('dataset.npz') as archive:
loaded_features = archive['features']
loaded_labels = archive['labels']
loaded_metadata = archive['metadata']
print(f"Features shape: {loaded_features.shape}")
print(f"Labels dtype: {loaded_labels.dtype}")
print(f"Available keys: {list(archive.keys())}")
The context manager (with statement) ensures the archive closes properly. You can also load without it, but the archive object stays open until garbage collected:
# Also valid, but less clean
archive = np.load('dataset.npz')
features = archive['features']
archive.close() # Don't forget this
Compression ratios vary dramatically based on data characteristics. Arrays with repeated values or smooth gradients compress well. Random floating-point data barely compresses at all. Profile both approaches with your actual data before committing to one.
Text-Based Storage with savetxt and loadtxt
Binary formats are efficient but opaque. When you need to inspect data manually, share with non-Python tools, or produce CSV files for downstream systems, text-based storage becomes necessary.
np.savetxt() and np.loadtxt() handle delimited text files:
import numpy as np
# Create sample data
measurements = np.array([
[1.0, 23.456, 78.9],
[2.0, 34.567, 89.0],
[3.0, 45.678, 90.1],
[4.0, 56.789, 91.2]
])
# Save as CSV with header
np.savetxt('measurements.csv',
measurements,
delimiter=',',
header='id,temperature,humidity',
comments='', # Suppress '#' before header
fmt=['%.0f', '%.3f', '%.1f']) # Custom formatting per column
# Load with explicit dtype
loaded = np.loadtxt('measurements.csv',
delimiter=',',
skiprows=1, # Skip header
dtype=np.float64)
print(loaded)
The fmt parameter controls number formatting. Use %.Nf for N decimal places, %d for integers, or %s for strings. You can provide a single format for all columns or a list for per-column control.
For more complex CSV files with mixed types or missing values, consider np.genfromtxt():
# Handle missing values and mixed types
data = np.genfromtxt('data_with_gaps.csv',
delimiter=',',
skip_header=1,
missing_values='NA',
filling_values=np.nan,
dtype=np.float64)
Text formats have significant drawbacks: they’re 2-10x larger than binary, slower to read/write, and can introduce floating-point representation errors. A float64 value might not survive a round-trip through text with full precision unless you use enough decimal places (typically 17 for float64).
Handling Common Issues
Real-world usage surfaces several pitfalls that trip up even experienced developers.
Pickle Security Concerns
By default, np.load() refuses to load files containing pickled objects:
# This fails with allow_pickle=False (default in recent NumPy)
# np.load('old_file_with_objects.npy')
# Explicitly allow if you trust the source
data = np.load('trusted_file.npy', allow_pickle=True)
Pickle can execute arbitrary code during deserialization. Never use allow_pickle=True on files from untrusted sources. If you’re loading your own files and getting pickle errors, it usually means the array contained Python objects (dtype=object) when saved.
Dtype Mismatches
Text loading requires careful dtype specification:
# Integer data saved as text
integers = np.array([1, 2, 3, 4, 5])
np.savetxt('integers.txt', integers, fmt='%d')
# Loading without dtype gives float64
loaded_float = np.loadtxt('integers.txt')
print(loaded_float.dtype) # float64
# Specify dtype explicitly
loaded_int = np.loadtxt('integers.txt', dtype=np.int32)
print(loaded_int.dtype) # int32
Memory Mapping Large Files
When arrays exceed available RAM, memory mapping lets you work with them without loading everything:
import numpy as np
# Create a large array and save it
large_array = np.random.randn(10000, 10000).astype(np.float32)
np.save('large_matrix.npy', large_array)
del large_array # Free memory
# Memory-map instead of loading
mmap_array = np.load('large_matrix.npy', mmap_mode='r')
# Access slices without loading full array into RAM
subset = mmap_array[1000:2000, 500:600]
print(f"Subset shape: {subset.shape}")
# mmap_mode options:
# 'r' - read-only
# 'r+' - read-write (changes written to disk)
# 'w+' - create/overwrite, read-write
# 'c' - copy-on-write (changes in memory only)
Memory mapping is particularly valuable in data pipelines where you only need array slices, not the entire dataset.
Choosing the Right Format
| Format | File Size | Read/Write Speed | Human Readable | Cross-Platform |
|---|---|---|---|---|
.npy |
Compact | Fast | No | NumPy required |
.npz |
Compact | Fast | No | NumPy required |
.npz compressed |
Smallest | Slower | No | NumPy required |
| Text (CSV) | Large | Slow | Yes | Universal |
Use .npy when: You’re saving single arrays for later use in Python, checkpointing intermediate computations, or prioritizing I/O speed.
Use .npz when: You have multiple related arrays that belong together logically, like features and labels in a dataset.
Use .npz compressed when: Storage space matters more than I/O speed, or you’re archiving data for long-term storage.
Use text formats when: You need to share data with non-Python tools, require human inspection, or must produce standard CSV output for downstream systems.
For production pipelines processing large volumes, binary formats are almost always correct. The performance difference compounds quickly—a 10x slowdown on each I/O operation adds up across thousands of files.
Conclusion
NumPy’s persistence utilities cover the spectrum from fast binary formats to portable text files. For most Python-centric workflows, .npy and .npz provide the best combination of speed, storage efficiency, and ease of use. Reserve text formats for interoperability requirements.
Key practices for production use: always verify array integrity after loading critical data, use memory mapping for arrays approaching RAM limits, and never enable pickle loading on untrusted files. When in doubt about format choice, start with .npy—it’s fast, simple, and handles the common case well.