Pandas - Create DataFrame from Dictionary
• Creating DataFrames from dictionaries is the most common pandas initialization pattern, with different dictionary structures producing different DataFrame orientations
Key Insights
• Creating DataFrames from dictionaries is the most common pandas initialization pattern, with different dictionary structures producing different DataFrame orientations
• Dictionary keys become column names by default, but you can transpose or use from_dict() with orient='index' to make keys become row labels instead
• Understanding how pandas handles nested dictionaries, lists of varying lengths, and missing values prevents common data structure errors during DataFrame creation
Basic Dictionary to DataFrame Conversion
The simplest way to create a DataFrame from a dictionary uses column names as keys and lists as values. Each key-value pair becomes a column in the resulting DataFrame.
import pandas as pd
data = {
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price': [999.99, 29.99, 79.99, 299.99],
'stock': [15, 150, 87, 42]
}
df = pd.DataFrame(data)
print(df)
product price stock
0 Laptop 999.99 15
1 Mouse 29.99 150
2 Keyboard 79.99 87
3 Monitor 299.99 42
This approach requires all lists to have the same length. If they don’t, pandas raises a ValueError. The index is automatically generated as a range starting from 0.
Dictionary with Scalar Values
When dictionary values are scalar (single values rather than lists), pandas broadcasts each value across all rows. You must specify the index explicitly.
defaults = {
'status': 'active',
'category': 'electronics',
'tax_rate': 0.08
}
df = pd.DataFrame(defaults, index=range(3))
print(df)
status category tax_rate
0 active electronics 0.08
1 active electronics 0.08
2 active electronics 0.08
This pattern is useful for creating template DataFrames or initializing default values before populating with actual data.
Using from_dict() with Orient Parameter
The from_dict() method provides more control over how dictionaries are interpreted. The orient parameter determines the dictionary structure.
# Orient='columns' (default behavior)
data_cols = {
'A': [1, 2, 3],
'B': [4, 5, 6]
}
df1 = pd.DataFrame.from_dict(data_cols, orient='columns')
# Orient='index' - keys become row labels
data_rows = {
'row1': [1, 4],
'row2': [2, 5],
'row3': [3, 6]
}
df2 = pd.DataFrame.from_dict(data_rows, orient='index', columns=['A', 'B'])
print("Columns orientation:")
print(df1)
print("\nIndex orientation:")
print(df2)
Columns orientation:
A B
0 1 4
1 2 5
2 3 6
Index orientation:
A B
row1 1 4
row2 2 5
row3 3 6
The orient='index' approach is particularly valuable when working with configuration data or when dictionary keys represent meaningful row identifiers.
Nested Dictionaries for Multi-Level Structures
Nested dictionaries create DataFrames where outer keys become columns and inner keys become row indices. This structure is ideal for hierarchical data.
sales_data = {
'Q1': {'Jan': 15000, 'Feb': 18000, 'Mar': 21000},
'Q2': {'Apr': 19000, 'May': 22000, 'Jun': 25000},
'Q3': {'Jul': 23000, 'Aug': 26000, 'Sep': 28000}
}
df = pd.DataFrame(sales_data)
print(df)
Q1 Q2 Q3
Jan 15000 NaN NaN
Feb 18000 NaN NaN
Mar 21000 NaN NaN
Apr NaN 19000 NaN
May NaN 22000 NaN
Jun NaN 25000 NaN
Jul NaN NaN 23000
Aug NaN NaN 26000
Sep NaN NaN 28000
Notice that pandas fills missing combinations with NaN. To avoid this sparse structure, ensure all inner dictionaries share the same keys or handle the data differently.
Handling Missing Values and Irregular Data
When dictionary values have different lengths or missing entries, pandas handles them gracefully by inserting NaN values.
irregular_data = {
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30], # Missing one value
'city': ['NYC', 'LA', 'Chicago', 'Boston'] # Extra value
}
df = pd.DataFrame(irregular_data)
print(df)
name age city
0 Alice 25.0 NYC
1 Bob 30.0 LA
2 Charlie NaN Chicago
3 NaN NaN Boston
For production code, validate data consistency before DataFrame creation:
def validate_dict_lengths(data):
lengths = {k: len(v) if isinstance(v, list) else 1
for k, v in data.items()}
if len(set(lengths.values())) > 1:
raise ValueError(f"Inconsistent lengths: {lengths}")
return True
# Use before creating DataFrame
try:
validate_dict_lengths(irregular_data)
except ValueError as e:
print(f"Data validation failed: {e}")
List of Dictionaries Pattern
When each dictionary represents a complete row, use a list of dictionaries. This is common when processing JSON API responses or database query results.
records = [
{'id': 1, 'name': 'Product A', 'price': 29.99, 'category': 'Electronics'},
{'id': 2, 'name': 'Product B', 'price': 49.99, 'category': 'Home'},
{'id': 3, 'name': 'Product C', 'price': 19.99} # Missing 'category'
]
df = pd.DataFrame(records)
print(df)
id name price category
0 1 Product A 29.99 Electronics
1 2 Product B 49.99 Home
2 3 Product C 19.99 NaN
This pattern automatically handles missing keys by inserting NaN values, making it robust for real-world data ingestion.
Custom Index and Column Names
Control DataFrame structure by explicitly setting index and columns during creation.
data = {
'revenue': [100000, 120000, 115000],
'costs': [70000, 80000, 75000]
}
df = pd.DataFrame(
data,
index=['2022', '2023', '2024'],
columns=['revenue', 'costs']
)
# Add computed column
df['profit'] = df['revenue'] - df['costs']
print(df)
revenue costs profit
2022 100000 70000 30000
2023 120000 80000 40000
2024 115000 75000 40000
Setting meaningful indices improves data readability and enables more intuitive data selection operations.
Dictionary Comprehensions for Dynamic Creation
Generate DataFrames dynamically using dictionary comprehensions, useful for creating test data or transforming existing structures.
import numpy as np
# Create sample data with comprehension
data = {
f'metric_{i}': np.random.randint(0, 100, 5)
for i in range(3)
}
df = pd.DataFrame(data)
print(df)
metric_0 metric_1 metric_2
0 67 45 23
1 89 12 56
2 34 78 90
3 56 23 45
4 12 67 34
Combine with conditions for filtered data generation:
# Create DataFrame from filtered dictionary
raw_data = {'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9], 'd': [10, 11, 12]}
filtered = {k: v for k, v in raw_data.items() if k in ['a', 'c']}
df = pd.DataFrame(filtered)
print(df)
a c
0 1 7
1 2 8
2 3 9
Performance Considerations
For large datasets, pre-allocate data structures and minimize dictionary operations inside loops.
import time
# Inefficient: Building dictionary in loop
start = time.time()
data = {}
for col in ['A', 'B', 'C']:
data[col] = list(range(100000))
df1 = pd.DataFrame(data)
print(f"Loop method: {time.time() - start:.4f}s")
# Efficient: Direct dictionary creation
start = time.time()
data = {col: list(range(100000)) for col in ['A', 'B', 'C']}
df2 = pd.DataFrame(data)
print(f"Comprehension method: {time.time() - start:.4f}s")
For extremely large datasets, consider using pd.DataFrame.from_records() with an iterator to reduce memory overhead, or initialize with numpy arrays directly before converting to DataFrame.