How to Convert DataFrame to NumPy Array in Pandas
Converting a pandas DataFrame to a NumPy array is one of those operations you'll reach for constantly. Machine learning libraries like scikit-learn expect NumPy arrays. Mathematical operations run...
Key Insights
- Use
.to_numpy()instead of the deprecated.valuesattribute—it’s the modern, explicit approach with better control over dtype conversion and NaN handling - Mixed data types in DataFrames will upcast to
objectdtype when converted, killing NumPy performance benefits—convert homogeneous column subsets separately - The
copyparameter matters for large datasets: settingcopy=Falsecan save memory but creates arrays that share memory with the original DataFrame
Converting a pandas DataFrame to a NumPy array is one of those operations you’ll reach for constantly. Machine learning libraries like scikit-learn expect NumPy arrays. Mathematical operations run faster on raw arrays. Some visualization tools want arrays, not DataFrames. Whatever your reason, pandas gives you several ways to make this conversion—but only one you should actually use in new code.
Let’s set up our environment and walk through the options.
import pandas as pd
import numpy as np
# Sample DataFrame for examples
df = pd.DataFrame({
'temperature': [72.5, 68.3, 75.1, 69.8, 71.2],
'humidity': [45, 52, 38, 61, 55],
'pressure': [1013.2, 1015.8, 1012.1, 1014.5, 1013.9]
})
Using the .values Attribute (Legacy Method)
You’ll see .values everywhere in older codebases and Stack Overflow answers. It works, it’s concise, and it returns the underlying NumPy array:
array = df.values
print(array)
# [[72.5 45. 1013.2]
# [68.3 52. 1015.8]
# [75.1 38. 1012.1]
# [69.8 61. 1014.5]
# [71.2 55. 1013.9]]
print(type(array))
# <class 'numpy.ndarray'>
Simple enough. But here’s the problem: .values is ambiguous about what it returns. For most DataFrames, you get an ndarray. But for DataFrames with extension types (like DatetimeTZDtype or SparseDtype), you might get something else entirely—an ExtensionArray that isn’t actually a NumPy array.
The pandas documentation now explicitly recommends against using .values for this reason. It’s not technically deprecated with a warning, but it’s considered legacy. Don’t use it in new code.
Using .to_numpy() Method (Recommended)
Introduced in pandas 0.24, .to_numpy() is the explicit, unambiguous way to get a NumPy array from your DataFrame:
array = df.to_numpy()
print(array)
# [[72.5 45. 1013.2]
# [68.3 52. 1015.8]
# [75.1 38. 1012.1]
# [69.8 61. 1014.5]
# [71.2 55. 1013.9]]
The real power comes from its parameters. You can specify the output dtype:
# Force float32 for memory efficiency
array_f32 = df.to_numpy(dtype=np.float32)
print(array_f32.dtype)
# float32
# Force integers (careful with floats!)
array_int = df[['humidity']].to_numpy(dtype=np.int64)
print(array_int)
# [[45]
# [52]
# [38]
# [61]
# [55]]
Handling missing values is where .to_numpy() really shines. NumPy doesn’t have a native concept of “missing” for numeric types—it uses np.nan, which is a float. The na_value parameter lets you specify what to substitute:
df_with_nan = pd.DataFrame({
'A': [1.0, 2.0, np.nan, 4.0],
'B': [5.0, np.nan, 7.0, 8.0]
})
# Default behavior: NaN stays as NaN
array_default = df_with_nan.to_numpy()
print(array_default)
# [[ 1. 5.]
# [ 2. nan]
# [nan 7.]
# [ 4. 8.]]
# Replace NaN with a sentinel value
array_filled = df_with_nan.to_numpy(dtype=np.float64, na_value=-999.0)
print(array_filled)
# [[ 1. 5.]
# [ 2. -999.]
# [-999. 7.]
# [ 4. 8.]]
This is cleaner than chaining .fillna(-999).to_numpy() and makes your intent explicit.
Converting Specific Columns or Rows
You rarely need the entire DataFrame as an array. More often, you want specific features for a model or a subset of rows for batch processing.
For columns, select them first, then convert:
# Single column returns 1D array
temp_array = df['temperature'].to_numpy()
print(temp_array.shape)
# (5,)
# Multiple columns return 2D array
features = df[['temperature', 'humidity']].to_numpy()
print(features.shape)
# (5, 2)
# Using .loc for label-based selection
subset = df.loc[:, 'temperature':'humidity'].to_numpy()
print(subset.shape)
# (5, 2)
For rows, use .iloc or boolean indexing:
# First 3 rows
first_batch = df.iloc[0:3].to_numpy()
print(first_batch)
# [[72.5 45. 1013.2]
# [68.3 52. 1015.8]
# [75.1 38. 1012.1]]
# Conditional selection
high_temp = df[df['temperature'] > 70].to_numpy()
print(high_temp)
# [[72.5 45. 1013.2]
# [75.1 38. 1012.1]
# [71.2 55. 1013.9]]
This pattern is essential for machine learning workflows where you need to separate features from targets:
X = df[['humidity', 'pressure']].to_numpy()
y = df['temperature'].to_numpy()
print(f"Features shape: {X.shape}") # (5, 2)
print(f"Target shape: {y.shape}") # (5,)
Handling Mixed Data Types
Here’s where things get tricky. DataFrames happily hold mixed types—strings, integers, floats, datetimes—in different columns. NumPy arrays are homogeneous. Something has to give.
When you convert a DataFrame with mixed types, NumPy upcasts everything to a common type that can hold all values. Usually, that’s object:
mixed_df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'salary': [50000.0, 60000.0, 70000.0]
})
mixed_array = mixed_df.to_numpy()
print(mixed_array.dtype)
# object
print(mixed_array)
# [['Alice' 25 50000.0]
# ['Bob' 30 60000.0]
# ['Charlie' 35 70000.0]]
This is bad. An object dtype array is essentially a Python list with array syntax. You lose all the performance benefits of NumPy’s contiguous memory layout and vectorized operations.
The solution: convert homogeneous subsets separately:
# Numeric columns only
numeric_cols = mixed_df.select_dtypes(include=[np.number]).columns
numeric_array = mixed_df[numeric_cols].to_numpy()
print(numeric_array.dtype)
# float64
# String columns separately
string_cols = mixed_df.select_dtypes(include=['object']).columns
string_array = mixed_df[string_cols].to_numpy()
print(string_array.dtype)
# object (unavoidable for strings)
For machine learning, you’ll typically encode categorical variables first anyway:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
mixed_df['name_encoded'] = le.fit_transform(mixed_df['name'])
# Now select only numeric columns
features = mixed_df[['name_encoded', 'age', 'salary']].to_numpy()
print(features.dtype)
# float64
Memory and Performance Considerations
The copy parameter controls whether .to_numpy() returns a copy of the data or a view into the DataFrame’s underlying memory:
df = pd.DataFrame({
'A': [1.0, 2.0, 3.0],
'B': [4.0, 5.0, 6.0]
})
# Default: may or may not copy depending on internal structure
array_default = df.to_numpy()
# Explicit copy
array_copy = df.to_numpy(copy=True)
# Request no copy (may still copy if necessary)
array_view = df.to_numpy(copy=False)
You can check whether memory is shared using NumPy’s utility:
print(np.shares_memory(df.to_numpy(copy=False), df['A'].to_numpy()))
# True (typically, for contiguous float data)
print(np.shares_memory(df.to_numpy(copy=True), df['A'].to_numpy()))
# False (forced copy)
Why does this matter? For large DataFrames, copying means doubling memory usage temporarily. If you’re working with a 10GB DataFrame, that copy could push you into swap or crash your process.
But views are dangerous too. Modifying the array modifies the DataFrame:
array_view = df.to_numpy(copy=False)
array_view[0, 0] = 999.0
print(df.iloc[0, 0])
# 999.0 (DataFrame was modified!)
My recommendation: use copy=True unless you’re certain you won’t modify the array and you need the memory savings. Debugging shared-memory mutations is painful.
Note that copy=False doesn’t guarantee a view—it just requests one. If the DataFrame’s internal structure requires a copy (like non-contiguous memory or type conversion), pandas will copy anyway. It’s a hint, not a command.
Conclusion
For new code, always use .to_numpy(). It’s explicit about what it returns, gives you control over dtype conversion and NaN handling, and follows current pandas best practices.
Quick reference:
# Basic conversion
array = df.to_numpy()
# With specific dtype
array = df.to_numpy(dtype=np.float32)
# Handling NaN values
array = df.to_numpy(na_value=0.0)
# Specific columns
array = df[['col1', 'col2']].to_numpy()
# Avoid memory copy for large data (use carefully)
array = df.to_numpy(copy=False)
If you’re going the other direction—converting a NumPy array back to a DataFrame—check out pd.DataFrame(array, columns=['col1', 'col2']). The process is straightforward, but column naming and index handling have their own quirks worth understanding.