Pandas - Create DataFrame from NumPy Array
• Creating DataFrames from NumPy arrays requires understanding dimensionality—1D arrays become single columns, while 2D arrays map rows and columns directly to DataFrame structure
Key Insights
• Creating DataFrames from NumPy arrays requires understanding dimensionality—1D arrays become single columns, while 2D arrays map rows and columns directly to DataFrame structure • Column and index labels must be explicitly provided when converting from NumPy arrays, as arrays contain no metadata about column names or row indices • Performance considerations matter at scale: pre-allocating NumPy arrays before DataFrame conversion is significantly faster than iteratively building DataFrames for large datasets
Basic DataFrame Creation from 2D Arrays
The most common scenario involves converting a 2D NumPy array into a DataFrame. The array’s first dimension represents rows, and the second represents columns.
import numpy as np
import pandas as pd
# Create a 2D NumPy array
data = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Convert to DataFrame
df = pd.DataFrame(data)
print(df)
Output:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
Without explicit labels, pandas assigns default integer indices for both rows and columns. For production code, always specify meaningful column names:
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df)
Output:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
Setting Custom Row Indices
Row indices provide meaningful labels for accessing data. Set them during DataFrame creation or afterward using the index parameter:
# Set index during creation
df = pd.DataFrame(data,
columns=['Revenue', 'Cost', 'Profit'],
index=['Q1', 'Q2', 'Q3'])
print(df)
Output:
Revenue Cost Profit
Q1 1 2 3
Q2 4 5 6
Q3 7 8 9
For time-series data, use datetime indices:
dates = pd.date_range('2024-01-01', periods=3)
df = pd.DataFrame(data,
columns=['Revenue', 'Cost', 'Profit'],
index=dates)
print(df)
Converting 1D Arrays to DataFrames
One-dimensional NumPy arrays create single-column DataFrames. The array length determines the number of rows:
# 1D array
arr_1d = np.array([10, 20, 30, 40, 50])
# Creates a single column
df = pd.DataFrame(arr_1d, columns=['Values'])
print(df)
Output:
Values
0 10
1 20
2 30
3 40
4 50
To create multiple columns from separate 1D arrays, stack them horizontally:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr3 = np.array([7, 8, 9])
# Stack arrays column-wise
combined = np.column_stack([arr1, arr2, arr3])
df = pd.DataFrame(combined, columns=['X', 'Y', 'Z'])
print(df)
Output:
X Y Z
0 1 4 7
1 2 5 8
2 3 6 9
Handling Different Data Types
NumPy arrays can contain various data types. Pandas preserves these types during conversion:
# Mixed numeric types
float_data = np.array([[1.5, 2.7, 3.9],
[4.1, 5.3, 6.8]])
df_float = pd.DataFrame(float_data, columns=['A', 'B', 'C'])
print(df_float.dtypes)
Output:
A float64
B float64
C float64
dtype: object
For structured arrays with named fields:
# Structured array with different types
structured = np.array([(1, 'Alice', 25.5),
(2, 'Bob', 30.2),
(3, 'Charlie', 28.7)],
dtype=[('id', 'i4'), ('name', 'U10'), ('score', 'f4')])
df = pd.DataFrame(structured)
print(df)
print("\nData types:")
print(df.dtypes)
Output:
id name score
0 1 Alice 25.5
1 2 Bob 30.2
2 3 Charlie 28.7
Data types:
id int32
name object
score float32
dtype: object
Working with Random Data and Array Generators
NumPy’s random module combined with DataFrame creation is useful for testing and simulations:
# Generate random data
np.random.seed(42)
random_data = np.random.randn(5, 4) # 5 rows, 4 columns
df = pd.DataFrame(random_data,
columns=['Feature1', 'Feature2', 'Feature3', 'Feature4'],
index=[f'Sample_{i}' for i in range(5)])
print(df.round(3))
Output:
Feature1 Feature2 Feature3 Feature4
Sample_0 0.496 0.862 0.647 -0.235
Sample_1 -0.234 1.579 -0.469 -0.465
Sample_2 0.241 -0.854 -0.187 1.533
Sample_3 -0.670 0.819 0.715 -0.603
Sample_4 -0.212 -0.720 -0.688 0.365
For integer ranges:
# Create sequential data
sequential = np.arange(20).reshape(5, 4)
df = pd.DataFrame(sequential, columns=list('ABCD'))
print(df)
Transposing Arrays Before Conversion
Sometimes your array orientation doesn’t match your desired DataFrame structure. Transpose before conversion:
# Array where rows should be columns
arr = np.array([[1, 2, 3, 4],
[5, 6, 7, 8]])
# Without transpose: 2 rows, 4 columns
df1 = pd.DataFrame(arr, columns=['A', 'B', 'C', 'D'])
print("Original:\n", df1)
# With transpose: 4 rows, 2 columns
df2 = pd.DataFrame(arr.T, columns=['Series1', 'Series2'])
print("\nTransposed:\n", df2)
Performance Considerations for Large Arrays
When working with large datasets, pre-allocating NumPy arrays and converting once is more efficient than building DataFrames iteratively:
import time
# Inefficient: Building DataFrame row by row
start = time.time()
df_slow = pd.DataFrame()
for i in range(10000):
df_slow = pd.concat([df_slow, pd.DataFrame([[i, i*2, i*3]])], ignore_index=True)
slow_time = time.time() - start
# Efficient: Pre-allocate array, convert once
start = time.time()
data = np.zeros((10000, 3))
for i in range(10000):
data[i] = [i, i*2, i*3]
df_fast = pd.DataFrame(data, columns=['A', 'B', 'C'])
fast_time = time.time() - start
print(f"Slow method: {slow_time:.4f}s")
print(f"Fast method: {fast_time:.4f}s")
print(f"Speedup: {slow_time/fast_time:.1f}x")
Converting Back to NumPy Arrays
Extract NumPy arrays from DataFrames using .values or .to_numpy():
df = pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=['A', 'B'])
# Extract as NumPy array
arr = df.to_numpy()
print(type(arr)) # <class 'numpy.ndarray'>
print(arr)
# Extract specific columns
arr_subset = df[['A']].to_numpy()
print(arr_subset.shape) # (2, 1)
The .to_numpy() method is preferred over .values as it provides more consistent behavior across different DataFrame types and is the recommended approach in modern pandas.
Practical Example: Image Data Processing
A real-world scenario involves processing image data stored as NumPy arrays:
# Simulate image data (height, width, channels)
image_array = np.random.randint(0, 256, size=(100, 100, 3), dtype=np.uint8)
# Flatten to 2D for analysis (pixels x channels)
flattened = image_array.reshape(-1, 3)
# Create DataFrame for pixel analysis
df_pixels = pd.DataFrame(flattened, columns=['Red', 'Green', 'Blue'])
# Calculate statistics
print("Channel statistics:")
print(df_pixels.describe())
# Find pixels with high red values
high_red = df_pixels[df_pixels['Red'] > 200]
print(f"\nPixels with Red > 200: {len(high_red)}")
This pattern applies to any multidimensional scientific data where pandas’ analytical capabilities complement NumPy’s computational efficiency.