Pandas - Create DataFrame from NumPy Array

Key Insights

• Creating DataFrames from NumPy arrays requires understanding dimensionality—1D arrays become single columns, while 2D arrays map rows and columns directly to DataFrame structure • Column and index labels must be explicitly provided when converting from NumPy arrays, as arrays contain no metadata about column names or row indices • Performance considerations matter at scale: pre-allocating NumPy arrays before DataFrame conversion is significantly faster than iteratively building DataFrames for large datasets

Basic DataFrame Creation from 2D Arrays

The most common scenario involves converting a 2D NumPy array into a DataFrame. The array’s first dimension represents rows, and the second represents columns.

import numpy as np
import pandas as pd

# Create a 2D NumPy array
data = np.array([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9]])

# Convert to DataFrame
df = pd.DataFrame(data)
print(df)

Output:

Without explicit labels, pandas assigns default integer indices for both rows and columns. For production code, always specify meaningful column names:

df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df)

Output:

Setting Custom Row Indices

Row indices provide meaningful labels for accessing data. Set them during DataFrame creation or afterward using the index parameter:

# Set index during creation
df = pd.DataFrame(data, 
                  columns=['Revenue', 'Cost', 'Profit'],
                  index=['Q1', 'Q2', 'Q3'])
print(df)

Output:

    Revenue  Cost  Profit
Q1        1     2       3
Q2        4     5       6
Q3        7     8       9

For time-series data, use datetime indices:

dates = pd.date_range('2024-01-01', periods=3)
df = pd.DataFrame(data, 
                  columns=['Revenue', 'Cost', 'Profit'],
                  index=dates)
print(df)

Converting 1D Arrays to DataFrames

One-dimensional NumPy arrays create single-column DataFrames. The array length determines the number of rows:

# 1D array
arr_1d = np.array([10, 20, 30, 40, 50])

# Creates a single column
df = pd.DataFrame(arr_1d, columns=['Values'])
print(df)

Output:

To create multiple columns from separate 1D arrays, stack them horizontally:

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr3 = np.array([7, 8, 9])

# Stack arrays column-wise
combined = np.column_stack([arr1, arr2, arr3])
df = pd.DataFrame(combined, columns=['X', 'Y', 'Z'])
print(df)

Output:

Handling Different Data Types

NumPy arrays can contain various data types. Pandas preserves these types during conversion:

# Mixed numeric types
float_data = np.array([[1.5, 2.7, 3.9],
                       [4.1, 5.3, 6.8]])

df_float = pd.DataFrame(float_data, columns=['A', 'B', 'C'])
print(df_float.dtypes)

Output:

A    float64
B    float64
C    float64
dtype: object

For structured arrays with named fields:

# Structured array with different types
structured = np.array([(1, 'Alice', 25.5),
                       (2, 'Bob', 30.2),
                       (3, 'Charlie', 28.7)],
                      dtype=[('id', 'i4'), ('name', 'U10'), ('score', 'f4')])

df = pd.DataFrame(structured)
print(df)
print("\nData types:")
print(df.dtypes)

Output:

   id     name  score
0   1    Alice   25.5
1   2      Bob   30.2
2   3  Charlie   28.7

Data types:
id         int32
name      object
score    float32
dtype: object

Working with Random Data and Array Generators

NumPy’s random module combined with DataFrame creation is useful for testing and simulations:

# Generate random data
np.random.seed(42)
random_data = np.random.randn(5, 4)  # 5 rows, 4 columns

df = pd.DataFrame(random_data, 
                  columns=['Feature1', 'Feature2', 'Feature3', 'Feature4'],
                  index=[f'Sample_{i}' for i in range(5)])
print(df.round(3))

Output:

           Feature1  Feature2  Feature3  Feature4
Sample_0      0.496     0.862     0.647    -0.235
Sample_1     -0.234     1.579    -0.469    -0.465
Sample_2      0.241    -0.854    -0.187     1.533
Sample_3     -0.670     0.819     0.715    -0.603
Sample_4     -0.212    -0.720    -0.688     0.365

For integer ranges:

# Create sequential data
sequential = np.arange(20).reshape(5, 4)
df = pd.DataFrame(sequential, columns=list('ABCD'))
print(df)

Transposing Arrays Before Conversion

Sometimes your array orientation doesn’t match your desired DataFrame structure. Transpose before conversion:

# Array where rows should be columns
arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8]])

# Without transpose: 2 rows, 4 columns
df1 = pd.DataFrame(arr, columns=['A', 'B', 'C', 'D'])
print("Original:\n", df1)

# With transpose: 4 rows, 2 columns
df2 = pd.DataFrame(arr.T, columns=['Series1', 'Series2'])
print("\nTransposed:\n", df2)

Performance Considerations for Large Arrays

When working with large datasets, pre-allocating NumPy arrays and converting once is more efficient than building DataFrames iteratively:

import time

# Inefficient: Building DataFrame row by row
start = time.time()
df_slow = pd.DataFrame()
for i in range(10000):
    df_slow = pd.concat([df_slow, pd.DataFrame([[i, i*2, i*3]])], ignore_index=True)
slow_time = time.time() - start

# Efficient: Pre-allocate array, convert once
start = time.time()
data = np.zeros((10000, 3))
for i in range(10000):
    data[i] = [i, i*2, i*3]
df_fast = pd.DataFrame(data, columns=['A', 'B', 'C'])
fast_time = time.time() - start

print(f"Slow method: {slow_time:.4f}s")
print(f"Fast method: {fast_time:.4f}s")
print(f"Speedup: {slow_time/fast_time:.1f}x")

Converting Back to NumPy Arrays

Extract NumPy arrays from DataFrames using .values or .to_numpy():

df = pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=['A', 'B'])

# Extract as NumPy array
arr = df.to_numpy()
print(type(arr))  # <class 'numpy.ndarray'>
print(arr)

# Extract specific columns
arr_subset = df[['A']].to_numpy()
print(arr_subset.shape)  # (2, 1)

The .to_numpy() method is preferred over .values as it provides more consistent behavior across different DataFrame types and is the recommended approach in modern pandas.

Practical Example: Image Data Processing

A real-world scenario involves processing image data stored as NumPy arrays:

# Simulate image data (height, width, channels)
image_array = np.random.randint(0, 256, size=(100, 100, 3), dtype=np.uint8)

# Flatten to 2D for analysis (pixels x channels)
flattened = image_array.reshape(-1, 3)

# Create DataFrame for pixel analysis
df_pixels = pd.DataFrame(flattened, columns=['Red', 'Green', 'Blue'])

# Calculate statistics
print("Channel statistics:")
print(df_pixels.describe())

# Find pixels with high red values
high_red = df_pixels[df_pixels['Red'] > 200]
print(f"\nPixels with Red > 200: {len(high_red)}")

This pattern applies to any multidimensional scientific data where pandas’ analytical capabilities complement NumPy’s computational efficiency.

Basic DataFrame Creation from 2D Arrays

Setting Custom Row Indices

Converting 1D Arrays to DataFrames

Handling Different Data Types

Working with Random Data and Array Generators

Transposing Arrays Before Conversion

Performance Considerations for Large Arrays

Converting Back to NumPy Arrays

Practical Example: Image Data Processing

Liked this? There's more.