How to Save and Load Arrays in NumPy

Persisting NumPy arrays to disk is a fundamental operation in data science and scientific computing workflows. Whether you're checkpointing intermediate results in a data pipeline, saving trained...

Key Insights

  • Use .npy for single arrays and .npz for multiple arrays—both are binary formats optimized for speed and storage efficiency
  • Text formats like savetxt sacrifice performance for human readability and interoperability with non-Python tools
  • Memory mapping with mmap_mode is essential when working with arrays larger than available RAM

Persisting NumPy arrays to disk is a fundamental operation in data science and scientific computing workflows. Whether you’re checkpointing intermediate results in a data pipeline, saving trained model weights, or sharing datasets with colleagues, you need reliable methods to serialize and deserialize array data.

NumPy provides several built-in approaches, each with distinct trade-offs. Binary formats offer speed and compact storage but require NumPy to read. Text formats provide human readability and broad compatibility at the cost of performance. Understanding when to use each approach will save you from debugging mysterious dtype mismatches and waiting on unnecessarily slow I/O operations.

Saving and Loading Single Arrays with .npy Format

The .npy format is NumPy’s native binary format for single arrays. It stores the array data along with metadata about dtype, shape, and byte order, enabling perfect reconstruction on load.

Use np.save() to write and np.load() to read:

import numpy as np

# Create a sample 2D array
data = np.array([
    [1.5, 2.3, 3.7],
    [4.1, 5.9, 6.2],
    [7.8, 8.4, 9.0]
], dtype=np.float64)

# Save to disk
np.save('matrix.npy', data)

# Load it back
loaded_data = np.load('matrix.npy')

# Verify integrity
print(f"Shape preserved: {data.shape == loaded_data.shape}")
print(f"Dtype preserved: {data.dtype == loaded_data.dtype}")
print(f"Data identical: {np.array_equal(data, loaded_data)}")

The .npy extension is conventional but not enforced—NumPy will save to whatever filename you provide. However, sticking with .npy makes file purposes obvious and helps tooling recognize the format.

One detail worth noting: np.save() automatically appends .npy if you omit the extension, but np.load() does not. This asymmetry can cause confusion:

# This creates 'matrix.npy'
np.save('matrix', data)

# This fails - file not found
# np.load('matrix')  # Wrong!

# This works
loaded = np.load('matrix.npy')

Be explicit with extensions to avoid this gotcha.

Saving Multiple Arrays with .npz Format

When you need to persist multiple related arrays, .npz files bundle them into a single archive. Think of it as a zip file containing multiple .npy files, accessible by name.

np.savez() creates an uncompressed archive, while np.savez_compressed() applies zip compression:

import numpy as np

# Multiple arrays representing a dataset
features = np.random.randn(1000, 50).astype(np.float32)
labels = np.random.randint(0, 10, size=1000).astype(np.int32)
metadata = np.array(['train', 'v1.0', '2024-01-15'], dtype='U20')

# Save as uncompressed archive
np.savez('dataset.npz', 
         features=features, 
         labels=labels, 
         metadata=metadata)

# Or save with compression (slower write, smaller file)
np.savez_compressed('dataset_compressed.npz',
                    features=features,
                    labels=labels,
                    metadata=metadata)

# Load and access by key
with np.load('dataset.npz') as archive:
    loaded_features = archive['features']
    loaded_labels = archive['labels']
    loaded_metadata = archive['metadata']
    
    print(f"Features shape: {loaded_features.shape}")
    print(f"Labels dtype: {loaded_labels.dtype}")
    print(f"Available keys: {list(archive.keys())}")

The context manager (with statement) ensures the archive closes properly. You can also load without it, but the archive object stays open until garbage collected:

# Also valid, but less clean
archive = np.load('dataset.npz')
features = archive['features']
archive.close()  # Don't forget this

Compression ratios vary dramatically based on data characteristics. Arrays with repeated values or smooth gradients compress well. Random floating-point data barely compresses at all. Profile both approaches with your actual data before committing to one.

Text-Based Storage with savetxt and loadtxt

Binary formats are efficient but opaque. When you need to inspect data manually, share with non-Python tools, or produce CSV files for downstream systems, text-based storage becomes necessary.

np.savetxt() and np.loadtxt() handle delimited text files:

import numpy as np

# Create sample data
measurements = np.array([
    [1.0, 23.456, 78.9],
    [2.0, 34.567, 89.0],
    [3.0, 45.678, 90.1],
    [4.0, 56.789, 91.2]
])

# Save as CSV with header
np.savetxt('measurements.csv', 
           measurements,
           delimiter=',',
           header='id,temperature,humidity',
           comments='',  # Suppress '#' before header
           fmt=['%.0f', '%.3f', '%.1f'])  # Custom formatting per column

# Load with explicit dtype
loaded = np.loadtxt('measurements.csv',
                    delimiter=',',
                    skiprows=1,  # Skip header
                    dtype=np.float64)

print(loaded)

The fmt parameter controls number formatting. Use %.Nf for N decimal places, %d for integers, or %s for strings. You can provide a single format for all columns or a list for per-column control.

For more complex CSV files with mixed types or missing values, consider np.genfromtxt():

# Handle missing values and mixed types
data = np.genfromtxt('data_with_gaps.csv',
                     delimiter=',',
                     skip_header=1,
                     missing_values='NA',
                     filling_values=np.nan,
                     dtype=np.float64)

Text formats have significant drawbacks: they’re 2-10x larger than binary, slower to read/write, and can introduce floating-point representation errors. A float64 value might not survive a round-trip through text with full precision unless you use enough decimal places (typically 17 for float64).

Handling Common Issues

Real-world usage surfaces several pitfalls that trip up even experienced developers.

Pickle Security Concerns

By default, np.load() refuses to load files containing pickled objects:

# This fails with allow_pickle=False (default in recent NumPy)
# np.load('old_file_with_objects.npy')

# Explicitly allow if you trust the source
data = np.load('trusted_file.npy', allow_pickle=True)

Pickle can execute arbitrary code during deserialization. Never use allow_pickle=True on files from untrusted sources. If you’re loading your own files and getting pickle errors, it usually means the array contained Python objects (dtype=object) when saved.

Dtype Mismatches

Text loading requires careful dtype specification:

# Integer data saved as text
integers = np.array([1, 2, 3, 4, 5])
np.savetxt('integers.txt', integers, fmt='%d')

# Loading without dtype gives float64
loaded_float = np.loadtxt('integers.txt')
print(loaded_float.dtype)  # float64

# Specify dtype explicitly
loaded_int = np.loadtxt('integers.txt', dtype=np.int32)
print(loaded_int.dtype)  # int32

Memory Mapping Large Files

When arrays exceed available RAM, memory mapping lets you work with them without loading everything:

import numpy as np

# Create a large array and save it
large_array = np.random.randn(10000, 10000).astype(np.float32)
np.save('large_matrix.npy', large_array)
del large_array  # Free memory

# Memory-map instead of loading
mmap_array = np.load('large_matrix.npy', mmap_mode='r')

# Access slices without loading full array into RAM
subset = mmap_array[1000:2000, 500:600]
print(f"Subset shape: {subset.shape}")

# mmap_mode options:
# 'r'  - read-only
# 'r+' - read-write (changes written to disk)
# 'w+' - create/overwrite, read-write
# 'c'  - copy-on-write (changes in memory only)

Memory mapping is particularly valuable in data pipelines where you only need array slices, not the entire dataset.

Choosing the Right Format

Format File Size Read/Write Speed Human Readable Cross-Platform
.npy Compact Fast No NumPy required
.npz Compact Fast No NumPy required
.npz compressed Smallest Slower No NumPy required
Text (CSV) Large Slow Yes Universal

Use .npy when: You’re saving single arrays for later use in Python, checkpointing intermediate computations, or prioritizing I/O speed.

Use .npz when: You have multiple related arrays that belong together logically, like features and labels in a dataset.

Use .npz compressed when: Storage space matters more than I/O speed, or you’re archiving data for long-term storage.

Use text formats when: You need to share data with non-Python tools, require human inspection, or must produce standard CSV output for downstream systems.

For production pipelines processing large volumes, binary formats are almost always correct. The performance difference compounds quickly—a 10x slowdown on each I/O operation adds up across thousands of files.

Conclusion

NumPy’s persistence utilities cover the spectrum from fast binary formats to portable text files. For most Python-centric workflows, .npy and .npz provide the best combination of speed, storage efficiency, and ease of use. Reserve text formats for interoperability requirements.

Key practices for production use: always verify array integrity after loading critical data, use memory mapping for arrays approaching RAM limits, and never enable pickle loading on untrusted files. When in doubt about format choice, start with .npy—it’s fast, simple, and handles the common case well.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.