Pandas - Create DataFrame from NumPy Array

• Creating DataFrames from NumPy arrays requires understanding dimensionality—1D arrays become single columns, while 2D arrays map rows and columns directly to DataFrame structure

Key Insights

• Creating DataFrames from NumPy arrays requires understanding dimensionality—1D arrays become single columns, while 2D arrays map rows and columns directly to DataFrame structure • Column and index labels must be explicitly provided when converting from NumPy arrays, as arrays contain no metadata about column names or row indices • Performance considerations matter at scale: pre-allocating NumPy arrays before DataFrame conversion is significantly faster than iteratively building DataFrames for large datasets

Basic DataFrame Creation from 2D Arrays

The most common scenario involves converting a 2D NumPy array into a DataFrame. The array’s first dimension represents rows, and the second represents columns.

import numpy as np
import pandas as pd

# Create a 2D NumPy array
data = np.array([[1, 2, 3],
                 [4, 5, 6],
                 [7, 8, 9]])

# Convert to DataFrame
df = pd.DataFrame(data)
print(df)

Output:

   0  1  2
0  1  2  3
1  4  5  6
2  7  8  9

Without explicit labels, pandas assigns default integer indices for both rows and columns. For production code, always specify meaningful column names:

df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df)

Output:

   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

Setting Custom Row Indices

Row indices provide meaningful labels for accessing data. Set them during DataFrame creation or afterward using the index parameter:

# Set index during creation
df = pd.DataFrame(data, 
                  columns=['Revenue', 'Cost', 'Profit'],
                  index=['Q1', 'Q2', 'Q3'])
print(df)

Output:

    Revenue  Cost  Profit
Q1        1     2       3
Q2        4     5       6
Q3        7     8       9

For time-series data, use datetime indices:

dates = pd.date_range('2024-01-01', periods=3)
df = pd.DataFrame(data, 
                  columns=['Revenue', 'Cost', 'Profit'],
                  index=dates)
print(df)

Converting 1D Arrays to DataFrames

One-dimensional NumPy arrays create single-column DataFrames. The array length determines the number of rows:

# 1D array
arr_1d = np.array([10, 20, 30, 40, 50])

# Creates a single column
df = pd.DataFrame(arr_1d, columns=['Values'])
print(df)

Output:

   Values
0      10
1      20
2      30
3      40
4      50

To create multiple columns from separate 1D arrays, stack them horizontally:

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr3 = np.array([7, 8, 9])

# Stack arrays column-wise
combined = np.column_stack([arr1, arr2, arr3])
df = pd.DataFrame(combined, columns=['X', 'Y', 'Z'])
print(df)

Output:

   X  Y  Z
0  1  4  7
1  2  5  8
2  3  6  9

Handling Different Data Types

NumPy arrays can contain various data types. Pandas preserves these types during conversion:

# Mixed numeric types
float_data = np.array([[1.5, 2.7, 3.9],
                       [4.1, 5.3, 6.8]])

df_float = pd.DataFrame(float_data, columns=['A', 'B', 'C'])
print(df_float.dtypes)

Output:

A    float64
B    float64
C    float64
dtype: object

For structured arrays with named fields:

# Structured array with different types
structured = np.array([(1, 'Alice', 25.5),
                       (2, 'Bob', 30.2),
                       (3, 'Charlie', 28.7)],
                      dtype=[('id', 'i4'), ('name', 'U10'), ('score', 'f4')])

df = pd.DataFrame(structured)
print(df)
print("\nData types:")
print(df.dtypes)

Output:

   id     name  score
0   1    Alice   25.5
1   2      Bob   30.2
2   3  Charlie   28.7

Data types:
id         int32
name      object
score    float32
dtype: object

Working with Random Data and Array Generators

NumPy’s random module combined with DataFrame creation is useful for testing and simulations:

# Generate random data
np.random.seed(42)
random_data = np.random.randn(5, 4)  # 5 rows, 4 columns

df = pd.DataFrame(random_data, 
                  columns=['Feature1', 'Feature2', 'Feature3', 'Feature4'],
                  index=[f'Sample_{i}' for i in range(5)])
print(df.round(3))

Output:

           Feature1  Feature2  Feature3  Feature4
Sample_0      0.496     0.862     0.647    -0.235
Sample_1     -0.234     1.579    -0.469    -0.465
Sample_2      0.241    -0.854    -0.187     1.533
Sample_3     -0.670     0.819     0.715    -0.603
Sample_4     -0.212    -0.720    -0.688     0.365

For integer ranges:

# Create sequential data
sequential = np.arange(20).reshape(5, 4)
df = pd.DataFrame(sequential, columns=list('ABCD'))
print(df)

Transposing Arrays Before Conversion

Sometimes your array orientation doesn’t match your desired DataFrame structure. Transpose before conversion:

# Array where rows should be columns
arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8]])

# Without transpose: 2 rows, 4 columns
df1 = pd.DataFrame(arr, columns=['A', 'B', 'C', 'D'])
print("Original:\n", df1)

# With transpose: 4 rows, 2 columns
df2 = pd.DataFrame(arr.T, columns=['Series1', 'Series2'])
print("\nTransposed:\n", df2)

Performance Considerations for Large Arrays

When working with large datasets, pre-allocating NumPy arrays and converting once is more efficient than building DataFrames iteratively:

import time

# Inefficient: Building DataFrame row by row
start = time.time()
df_slow = pd.DataFrame()
for i in range(10000):
    df_slow = pd.concat([df_slow, pd.DataFrame([[i, i*2, i*3]])], ignore_index=True)
slow_time = time.time() - start

# Efficient: Pre-allocate array, convert once
start = time.time()
data = np.zeros((10000, 3))
for i in range(10000):
    data[i] = [i, i*2, i*3]
df_fast = pd.DataFrame(data, columns=['A', 'B', 'C'])
fast_time = time.time() - start

print(f"Slow method: {slow_time:.4f}s")
print(f"Fast method: {fast_time:.4f}s")
print(f"Speedup: {slow_time/fast_time:.1f}x")

Converting Back to NumPy Arrays

Extract NumPy arrays from DataFrames using .values or .to_numpy():

df = pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=['A', 'B'])

# Extract as NumPy array
arr = df.to_numpy()
print(type(arr))  # <class 'numpy.ndarray'>
print(arr)

# Extract specific columns
arr_subset = df[['A']].to_numpy()
print(arr_subset.shape)  # (2, 1)

The .to_numpy() method is preferred over .values as it provides more consistent behavior across different DataFrame types and is the recommended approach in modern pandas.

Practical Example: Image Data Processing

A real-world scenario involves processing image data stored as NumPy arrays:

# Simulate image data (height, width, channels)
image_array = np.random.randint(0, 256, size=(100, 100, 3), dtype=np.uint8)

# Flatten to 2D for analysis (pixels x channels)
flattened = image_array.reshape(-1, 3)

# Create DataFrame for pixel analysis
df_pixels = pd.DataFrame(flattened, columns=['Red', 'Green', 'Blue'])

# Calculate statistics
print("Channel statistics:")
print(df_pixels.describe())

# Find pixels with high red values
high_red = df_pixels[df_pixels['Red'] > 200]
print(f"\nPixels with Red > 200: {len(high_red)}")

This pattern applies to any multidimensional scientific data where pandas’ analytical capabilities complement NumPy’s computational efficiency.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.