NumPy - Load Array from File (np.load)

NumPy provides native binary formats optimized for array storage. The `.npy` format stores a single array with metadata describing shape, dtype, and byte order. The `.npz` format bundles multiple...

Key Insights

  • NumPy’s np.load() handles .npy and .npz formats natively, with .npy storing single arrays in binary format and .npz supporting multiple compressed arrays in a single archive
  • Binary .npy files load 10-100x faster than text formats and preserve exact numerical precision, making them ideal for production data pipelines and machine learning workflows
  • Memory-mapped loading with mmap_mode enables working with arrays larger than RAM by loading data on-demand rather than loading entire files into memory

Understanding NumPy File Formats

NumPy provides native binary formats optimized for array storage. The .npy format stores a single array with metadata describing shape, dtype, and byte order. The .npz format bundles multiple arrays into a compressed ZIP archive.

import numpy as np

# Create and save a single array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
np.save('data.npy', data)

# Load the array
loaded_data = np.load('data.npy')
print(loaded_data)
# [[1 2 3]
#  [4 5 6]
#  [7 8 9]]

print(f"Shape: {loaded_data.shape}, Dtype: {loaded_data.dtype}")
# Shape: (3, 3), Dtype: int64

The binary format preserves exact data types, including complex types like float32, int16, or structured arrays. Text formats like CSV often require explicit dtype specification and lose precision during conversion.

Loading Multiple Arrays from NPZ Files

The .npz format stores multiple named arrays efficiently. Use np.savez() for uncompressed archives or np.savez_compressed() for automatic compression.

# Save multiple arrays
features = np.random.rand(1000, 50)
labels = np.random.randint(0, 10, size=1000)
metadata = np.array(['experiment_001', '2024-01-15'])

np.savez_compressed('dataset.npz', 
                    features=features, 
                    labels=labels, 
                    metadata=metadata)

# Load and access arrays
archive = np.load('dataset.npz')

# Access by name
print(archive['features'].shape)  # (1000, 50)
print(archive['labels'].shape)    # (1000,)

# List available arrays
print(archive.files)  # ['features', 'labels', 'metadata']

# Extract to variables
loaded_features = archive['features']
loaded_labels = archive['labels']

# Close the archive when done
archive.close()

The archive object behaves like a dictionary but requires explicit closing or context manager usage to release file handles properly.

# Better: use context manager
with np.load('dataset.npz') as archive:
    features = archive['features']
    labels = archive['labels']
    # Archive automatically closes after this block

Memory-Mapped Loading for Large Files

Memory mapping loads array metadata immediately but reads actual data only when accessed. This enables working with datasets larger than available RAM.

# Create a large array and save it
large_array = np.random.rand(10000, 10000)  # ~800 MB
np.save('large_data.npy', large_array)

# Load with memory mapping
mmap_array = np.load('large_data.npy', mmap_mode='r')

print(f"Array shape: {mmap_array.shape}")  # Instant - no data loaded
print(f"Array dtype: {mmap_array.dtype}")  # Instant

# Data loads only when accessed
subset = mmap_array[0:100, 0:100]  # Loads only this slice
mean_value = mmap_array[:, 0].mean()  # Loads only first column

Memory-map modes control read/write permissions:

# Read-only (default for safety)
data_r = np.load('data.npy', mmap_mode='r')

# Read-write (modify file directly)
data_rw = np.load('data.npy', mmap_mode='r+')
data_rw[0, 0] = 999  # Writes directly to file
del data_rw  # Flush changes

# Copy-on-write (modifications don't affect file)
data_c = np.load('data.npy', mmap_mode='c')
data_c[0, 0] = 999  # Only modifies memory, not file

Handling Pickled Arrays and Security

NumPy files can contain pickled Python objects, which pose security risks when loading untrusted data. The allow_pickle parameter controls this behavior.

# Create array with Python objects
obj_array = np.array([{'key': 'value'}, {'foo': 'bar'}], dtype=object)
np.save('objects.npy', obj_array)

# Loading requires allow_pickle=True (default in older versions)
try:
    loaded = np.load('objects.npy', allow_pickle=True)
    print(loaded)
except ValueError as e:
    print(f"Error: {e}")

For security-critical applications, disable pickle loading:

# Reject files containing pickled objects
try:
    data = np.load('untrusted.npy', allow_pickle=False)
except ValueError:
    print("File contains pickled data - rejected for security")

Only load pickled arrays from trusted sources. Malicious pickle data can execute arbitrary code during deserialization.

Error Handling and Validation

Implement robust error handling when loading arrays in production systems:

def load_array_safe(filepath, expected_shape=None, expected_dtype=None):
    """Load NumPy array with validation."""
    try:
        data = np.load(filepath)
        
        # Validate shape
        if expected_shape and data.shape != expected_shape:
            raise ValueError(
                f"Shape mismatch: expected {expected_shape}, "
                f"got {data.shape}"
            )
        
        # Validate dtype
        if expected_dtype and data.dtype != expected_dtype:
            raise ValueError(
                f"Dtype mismatch: expected {expected_dtype}, "
                f"got {data.dtype}"
            )
        
        return data
    
    except FileNotFoundError:
        raise FileNotFoundError(f"Array file not found: {filepath}")
    except Exception as e:
        raise RuntimeError(f"Failed to load array: {e}")

# Usage
try:
    data = load_array_safe('data.npy', 
                          expected_shape=(1000, 50), 
                          expected_dtype=np.float32)
except (FileNotFoundError, ValueError, RuntimeError) as e:
    print(f"Loading failed: {e}")

Performance Comparison with Text Formats

Binary formats significantly outperform text-based alternatives:

import time

# Create test data
test_data = np.random.rand(10000, 100)

# Save and load as NPY
start = time.time()
np.save('test.npy', test_data)
npy_save_time = time.time() - start

start = time.time()
loaded_npy = np.load('test.npy')
npy_load_time = time.time() - start

# Save and load as CSV
start = time.time()
np.savetxt('test.csv', test_data, delimiter=',')
csv_save_time = time.time() - start

start = time.time()
loaded_csv = np.loadtxt('test.csv', delimiter=',')
csv_load_time = time.time() - start

print(f"NPY save: {npy_save_time:.4f}s, load: {npy_load_time:.4f}s")
print(f"CSV save: {csv_save_time:.4f}s, load: {csv_load_time:.4f}s")
print(f"Speedup: {csv_load_time/npy_load_time:.1f}x faster")

Binary formats also preserve precision exactly:

# Demonstrate precision preservation
original = np.array([1.23456789012345], dtype=np.float64)

# NPY preserves exact precision
np.save('precise.npy', original)
loaded_npy = np.load('precise.npy')
print(f"NPY: {loaded_npy[0]:.15f}")  # 1.234567890123450

# CSV loses precision
np.savetxt('precise.csv', original)
loaded_csv = np.loadtxt('precise.csv')
print(f"CSV: {loaded_csv:.15f}")  # Depends on format string

Loading Arrays in Data Pipelines

Integrate np.load() into machine learning pipelines efficiently:

class DataLoader:
    """Efficient data loader for training pipelines."""
    
    def __init__(self, data_path, batch_size=32):
        self.data = np.load(data_path, mmap_mode='r')
        self.batch_size = batch_size
        self.n_samples = self.data.shape[0]
    
    def __len__(self):
        return (self.n_samples + self.batch_size - 1) // self.batch_size
    
    def get_batch(self, idx):
        start = idx * self.batch_size
        end = min(start + self.batch_size, self.n_samples)
        return self.data[start:end].copy()
    
    def iterate_batches(self):
        for i in range(len(self)):
            yield self.get_batch(i)

# Usage
loader = DataLoader('training_data.npy', batch_size=64)
for batch in loader.iterate_batches():
    # Process batch
    print(f"Processing batch shape: {batch.shape}")

This pattern minimizes memory usage while maintaining fast access to training data. The memory-mapped array loads only requested batches, enabling training on datasets that exceed available RAM.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.