NumPy - Load Array from File (np.load)
NumPy provides native binary formats optimized for array storage. The `.npy` format stores a single array with metadata describing shape, dtype, and byte order. The `.npz` format bundles multiple...
Key Insights
- NumPy’s
np.load()handles.npyand.npzformats natively, with.npystoring single arrays in binary format and.npzsupporting multiple compressed arrays in a single archive - Binary
.npyfiles load 10-100x faster than text formats and preserve exact numerical precision, making them ideal for production data pipelines and machine learning workflows - Memory-mapped loading with
mmap_modeenables working with arrays larger than RAM by loading data on-demand rather than loading entire files into memory
Understanding NumPy File Formats
NumPy provides native binary formats optimized for array storage. The .npy format stores a single array with metadata describing shape, dtype, and byte order. The .npz format bundles multiple arrays into a compressed ZIP archive.
import numpy as np
# Create and save a single array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
np.save('data.npy', data)
# Load the array
loaded_data = np.load('data.npy')
print(loaded_data)
# [[1 2 3]
# [4 5 6]
# [7 8 9]]
print(f"Shape: {loaded_data.shape}, Dtype: {loaded_data.dtype}")
# Shape: (3, 3), Dtype: int64
The binary format preserves exact data types, including complex types like float32, int16, or structured arrays. Text formats like CSV often require explicit dtype specification and lose precision during conversion.
Loading Multiple Arrays from NPZ Files
The .npz format stores multiple named arrays efficiently. Use np.savez() for uncompressed archives or np.savez_compressed() for automatic compression.
# Save multiple arrays
features = np.random.rand(1000, 50)
labels = np.random.randint(0, 10, size=1000)
metadata = np.array(['experiment_001', '2024-01-15'])
np.savez_compressed('dataset.npz',
features=features,
labels=labels,
metadata=metadata)
# Load and access arrays
archive = np.load('dataset.npz')
# Access by name
print(archive['features'].shape) # (1000, 50)
print(archive['labels'].shape) # (1000,)
# List available arrays
print(archive.files) # ['features', 'labels', 'metadata']
# Extract to variables
loaded_features = archive['features']
loaded_labels = archive['labels']
# Close the archive when done
archive.close()
The archive object behaves like a dictionary but requires explicit closing or context manager usage to release file handles properly.
# Better: use context manager
with np.load('dataset.npz') as archive:
features = archive['features']
labels = archive['labels']
# Archive automatically closes after this block
Memory-Mapped Loading for Large Files
Memory mapping loads array metadata immediately but reads actual data only when accessed. This enables working with datasets larger than available RAM.
# Create a large array and save it
large_array = np.random.rand(10000, 10000) # ~800 MB
np.save('large_data.npy', large_array)
# Load with memory mapping
mmap_array = np.load('large_data.npy', mmap_mode='r')
print(f"Array shape: {mmap_array.shape}") # Instant - no data loaded
print(f"Array dtype: {mmap_array.dtype}") # Instant
# Data loads only when accessed
subset = mmap_array[0:100, 0:100] # Loads only this slice
mean_value = mmap_array[:, 0].mean() # Loads only first column
Memory-map modes control read/write permissions:
# Read-only (default for safety)
data_r = np.load('data.npy', mmap_mode='r')
# Read-write (modify file directly)
data_rw = np.load('data.npy', mmap_mode='r+')
data_rw[0, 0] = 999 # Writes directly to file
del data_rw # Flush changes
# Copy-on-write (modifications don't affect file)
data_c = np.load('data.npy', mmap_mode='c')
data_c[0, 0] = 999 # Only modifies memory, not file
Handling Pickled Arrays and Security
NumPy files can contain pickled Python objects, which pose security risks when loading untrusted data. The allow_pickle parameter controls this behavior.
# Create array with Python objects
obj_array = np.array([{'key': 'value'}, {'foo': 'bar'}], dtype=object)
np.save('objects.npy', obj_array)
# Loading requires allow_pickle=True (default in older versions)
try:
loaded = np.load('objects.npy', allow_pickle=True)
print(loaded)
except ValueError as e:
print(f"Error: {e}")
For security-critical applications, disable pickle loading:
# Reject files containing pickled objects
try:
data = np.load('untrusted.npy', allow_pickle=False)
except ValueError:
print("File contains pickled data - rejected for security")
Only load pickled arrays from trusted sources. Malicious pickle data can execute arbitrary code during deserialization.
Error Handling and Validation
Implement robust error handling when loading arrays in production systems:
def load_array_safe(filepath, expected_shape=None, expected_dtype=None):
"""Load NumPy array with validation."""
try:
data = np.load(filepath)
# Validate shape
if expected_shape and data.shape != expected_shape:
raise ValueError(
f"Shape mismatch: expected {expected_shape}, "
f"got {data.shape}"
)
# Validate dtype
if expected_dtype and data.dtype != expected_dtype:
raise ValueError(
f"Dtype mismatch: expected {expected_dtype}, "
f"got {data.dtype}"
)
return data
except FileNotFoundError:
raise FileNotFoundError(f"Array file not found: {filepath}")
except Exception as e:
raise RuntimeError(f"Failed to load array: {e}")
# Usage
try:
data = load_array_safe('data.npy',
expected_shape=(1000, 50),
expected_dtype=np.float32)
except (FileNotFoundError, ValueError, RuntimeError) as e:
print(f"Loading failed: {e}")
Performance Comparison with Text Formats
Binary formats significantly outperform text-based alternatives:
import time
# Create test data
test_data = np.random.rand(10000, 100)
# Save and load as NPY
start = time.time()
np.save('test.npy', test_data)
npy_save_time = time.time() - start
start = time.time()
loaded_npy = np.load('test.npy')
npy_load_time = time.time() - start
# Save and load as CSV
start = time.time()
np.savetxt('test.csv', test_data, delimiter=',')
csv_save_time = time.time() - start
start = time.time()
loaded_csv = np.loadtxt('test.csv', delimiter=',')
csv_load_time = time.time() - start
print(f"NPY save: {npy_save_time:.4f}s, load: {npy_load_time:.4f}s")
print(f"CSV save: {csv_save_time:.4f}s, load: {csv_load_time:.4f}s")
print(f"Speedup: {csv_load_time/npy_load_time:.1f}x faster")
Binary formats also preserve precision exactly:
# Demonstrate precision preservation
original = np.array([1.23456789012345], dtype=np.float64)
# NPY preserves exact precision
np.save('precise.npy', original)
loaded_npy = np.load('precise.npy')
print(f"NPY: {loaded_npy[0]:.15f}") # 1.234567890123450
# CSV loses precision
np.savetxt('precise.csv', original)
loaded_csv = np.loadtxt('precise.csv')
print(f"CSV: {loaded_csv:.15f}") # Depends on format string
Loading Arrays in Data Pipelines
Integrate np.load() into machine learning pipelines efficiently:
class DataLoader:
"""Efficient data loader for training pipelines."""
def __init__(self, data_path, batch_size=32):
self.data = np.load(data_path, mmap_mode='r')
self.batch_size = batch_size
self.n_samples = self.data.shape[0]
def __len__(self):
return (self.n_samples + self.batch_size - 1) // self.batch_size
def get_batch(self, idx):
start = idx * self.batch_size
end = min(start + self.batch_size, self.n_samples)
return self.data[start:end].copy()
def iterate_batches(self):
for i in range(len(self)):
yield self.get_batch(i)
# Usage
loader = DataLoader('training_data.npy', batch_size=64)
for batch in loader.iterate_batches():
# Process batch
print(f"Processing batch shape: {batch.shape}")
This pattern minimizes memory usage while maintaining fast access to training data. The memory-mapped array loads only requested batches, enabling training on datasets that exceed available RAM.