Pandas - Convert DataFrame to NumPy Array
Pandas provides two primary methods for converting DataFrames to NumPy arrays: `values` and `to_numpy()`. While `values` has been the traditional approach, `to_numpy()` is now the recommended method.
Key Insights
- Use
df.valuesordf.to_numpy()to convert DataFrames to NumPy arrays, withto_numpy()being the recommended approach for better dtype handling and explicit control - Converting to NumPy arrays eliminates column names and index information, creating a pure numerical matrix suitable for machine learning algorithms and mathematical operations
- Different conversion methods handle mixed dtypes differently—understand when Pandas will upcast to object dtype versus maintaining homogeneous types for optimal performance
Basic Conversion Methods
Pandas provides two primary methods for converting DataFrames to NumPy arrays: values and to_numpy(). While values has been the traditional approach, to_numpy() is now the recommended method.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]
})
# Using to_numpy() (recommended)
array1 = df.to_numpy()
print(array1)
print(f"Type: {type(array1)}")
print(f"Shape: {array1.shape}")
# Using values (legacy approach)
array2 = df.values
print(f"\nArrays equal: {np.array_equal(array1, array2)}")
Output:
[[ 1 5 9]
[ 2 6 10]
[ 3 7 11]
[ 4 8 12]]
Type: <class 'numpy.ndarray'>
Shape: (4, 3)
Arrays equal: True
The resulting array is a 2D NumPy array where each row represents a DataFrame row and columns maintain their original order.
Handling Data Types
When converting DataFrames with homogeneous data types, the resulting NumPy array maintains that dtype. However, mixed dtypes require special consideration.
# Homogeneous dtype
df_int = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
array_int = df_int.to_numpy()
print(f"Homogeneous dtype: {array_int.dtype}") # int64
# Mixed dtypes - results in object dtype
df_mixed = pd.DataFrame({
'A': [1, 2, 3],
'B': ['x', 'y', 'z'],
'C': [1.5, 2.5, 3.5]
})
array_mixed = df_mixed.to_numpy()
print(f"Mixed dtype: {array_mixed.dtype}") # object
# Specify dtype explicitly
array_float = df_mixed[['A', 'C']].to_numpy(dtype='float64')
print(f"Specified dtype: {array_float.dtype}") # float64
When dealing with mixed dtypes, NumPy defaults to object dtype, which is less efficient for numerical operations. Extract homogeneous columns first or specify the desired dtype explicitly.
Converting Specific Columns
Often you need only numerical columns for machine learning pipelines. Select specific columns before conversion to maintain type consistency.
df = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 35, 40],
'salary': [50000, 60000, 75000, 80000],
'active': [True, True, False, True]
})
# Convert only numerical columns
numerical_cols = ['age', 'salary']
X = df[numerical_cols].to_numpy()
print(f"Shape: {X.shape}, dtype: {X.dtype}")
# Using select_dtypes for automatic selection
numerical_data = df.select_dtypes(include=[np.number]).to_numpy()
print(f"Auto-selected shape: {numerical_data.shape}")
# Convert boolean column to integers
bool_array = df['active'].to_numpy(dtype=int)
print(f"Boolean as int: {bool_array}")
Handling Missing Values
Missing values (NaN) require careful handling during conversion, as they affect downstream numerical operations.
df_nan = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8],
'C': [9, 10, 11, 12]
})
# Direct conversion preserves NaN
array_with_nan = df_nan.to_numpy()
print("Array with NaN:")
print(array_with_nan)
print(f"dtype: {array_with_nan.dtype}")
# Fill NaN before conversion
array_filled = df_nan.fillna(0).to_numpy()
print("\nArray with NaN filled:")
print(array_filled)
# Drop rows with NaN
array_dropped = df_nan.dropna().to_numpy()
print(f"\nShape after dropping NaN: {array_dropped.shape}")
# Use interpolation
array_interpolated = df_nan.interpolate().to_numpy()
print("\nInterpolated array:")
print(array_interpolated)
Converting Series to Array
Single column conversions (Series to array) follow similar patterns but result in 1D arrays.
series = pd.Series([10, 20, 30, 40, 50])
# Convert Series to 1D array
array_1d = series.to_numpy()
print(f"1D array shape: {array_1d.shape}") # (5,)
# Convert to 2D array (column vector)
array_2d = series.to_numpy().reshape(-1, 1)
print(f"2D array shape: {array_2d.shape}") # (5, 1)
# Alternative: use values
array_values = series.values
print(f"Arrays equal: {np.array_equal(array_1d, array_values)}")
Performance Considerations
For large DataFrames, conversion performance matters. The to_numpy() method is optimized but understanding memory layout helps.
import time
# Create large DataFrame
large_df = pd.DataFrame(
np.random.randn(100000, 50),
columns=[f'col_{i}' for i in range(50)]
)
# Benchmark to_numpy()
start = time.time()
array1 = large_df.to_numpy()
time1 = time.time() - start
# Benchmark values
start = time.time()
array2 = large_df.values
time2 = time.time() - start
print(f"to_numpy() time: {time1:.4f}s")
print(f"values time: {time2:.4f}s")
# Check memory layout
print(f"C-contiguous: {array1.flags['C_CONTIGUOUS']}")
print(f"F-contiguous: {array1.flags['F_CONTIGUOUS']}")
Copy vs View Behavior
Understanding whether conversion creates a copy or view affects memory usage and data integrity.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# to_numpy() typically returns a view when possible
array = df.to_numpy()
# Modify array
array[0, 0] = 999
# Check if DataFrame changed
print("DataFrame after array modification:")
print(df) # May or may not change depending on internal structure
# Force a copy
array_copy = df.to_numpy(copy=True)
array_copy[0, 0] = 888
print("\nDataFrame after copy modification:")
print(df) # Unchanged
Integration with Scikit-learn
Converting DataFrames to NumPy arrays is essential for scikit-learn compatibility.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame({
'feature1': np.random.randn(1000),
'feature2': np.random.randn(1000),
'feature3': np.random.randn(1000),
'target': np.random.randint(0, 2, 1000)
})
# Separate features and target
X = df[['feature1', 'feature2', 'feature3']].to_numpy()
y = df['target'].to_numpy()
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")
print(f"Scaled data type: {X_train_scaled.dtype}")
Preserving Index and Column Information
If you need to maintain metadata while working with arrays, store index and column information separately.
df = pd.DataFrame(
{'A': [1, 2, 3], 'B': [4, 5, 6]},
index=['row1', 'row2', 'row3']
)
# Convert to array
array = df.to_numpy()
# Store metadata
columns = df.columns.tolist()
index = df.index.tolist()
# Perform array operations
result_array = array * 2
# Reconstruct DataFrame
result_df = pd.DataFrame(result_array, columns=columns, index=index)
print(result_df)
This approach enables efficient numerical computation while preserving the ability to reconstruct labeled DataFrames when needed.