Pandas - Concatenate DataFrames (concat)
The default behavior of `pd.concat()` stacks DataFrames vertically, appending rows from multiple DataFrames into a single structure. This is the most common use case when combining datasets with...
Key Insights
pd.concat()combines DataFrames along rows or columns with fine-grained control over index handling, join types, and hierarchical labeling through thekeysparameter- Understanding axis orientation (0 for rows, 1 for columns) and join strategies (inner vs outer) prevents data loss and unexpected results when merging datasets with mismatched columns or indexes
- Performance matters: concatenating large datasets benefits from pre-allocation strategies and avoiding repeated concatenation in loops—build a list and concatenate once
Basic Concatenation Along Rows
The default behavior of pd.concat() stacks DataFrames vertically, appending rows from multiple DataFrames into a single structure. This is the most common use case when combining datasets with identical or similar column structures.
import pandas as pd
df1 = pd.DataFrame({
'product': ['A', 'B', 'C'],
'sales': [100, 150, 200],
'region': ['North', 'South', 'East']
})
df2 = pd.DataFrame({
'product': ['D', 'E', 'F'],
'sales': [175, 225, 190],
'region': ['West', 'North', 'South']
})
result = pd.concat([df1, df2])
print(result)
Output:
product sales region
0 A 100 North
1 B 150 South
2 C 200 East
0 D 175 West
1 E 225 North
2 F 190 South
Notice the index values repeat. Use ignore_index=True to create a continuous integer index:
result = pd.concat([df1, df2], ignore_index=True)
print(result)
Output:
product sales region
0 A 100 North
1 B 150 South
2 C 200 East
3 D 175 West
4 E 225 North
5 F 190 South
Concatenation Along Columns
Set axis=1 to combine DataFrames horizontally, joining them side-by-side. This operation aligns DataFrames based on their index values.
df_sales = pd.DataFrame({
'Q1': [100, 150, 200],
'Q2': [110, 160, 210]
}, index=['ProductA', 'ProductB', 'ProductC'])
df_costs = pd.DataFrame({
'Q1_cost': [80, 120, 150],
'Q2_cost': [85, 125, 155]
}, index=['ProductA', 'ProductB', 'ProductC'])
result = pd.concat([df_sales, df_costs], axis=1)
print(result)
Output:
Q1 Q2 Q1_cost Q2_cost
ProductA 100 110 80 85
ProductB 150 160 120 125
ProductC 200 210 150 155
Handling Mismatched Columns with Join Strategies
When DataFrames have different columns, pd.concat() performs an outer join by default, including all columns and filling missing values with NaN. Use the join parameter to control this behavior.
df1 = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
df2 = pd.DataFrame({
'B': [7, 8, 9],
'C': [10, 11, 12]
})
# Outer join (default)
outer_result = pd.concat([df1, df2], ignore_index=True)
print("Outer join:")
print(outer_result)
# Inner join - only common columns
inner_result = pd.concat([df1, df2], join='inner', ignore_index=True)
print("\nInner join:")
print(inner_result)
Output:
Outer join:
A B C
0 1.0 4 NaN
1 2.0 5 NaN
2 3.0 6 NaN
3 NaN 7 10.0
4 NaN 8 11.0
5 NaN 9 12.0
Inner join:
B
0 4
1 5
2 6
3 7
4 8
5 9
Using Keys for Hierarchical Indexing
The keys parameter creates a MultiIndex, labeling which DataFrame each row originated from. This is invaluable for tracking data sources after concatenation.
df_2022 = pd.DataFrame({
'product': ['A', 'B', 'C'],
'revenue': [1000, 1500, 2000]
})
df_2023 = pd.DataFrame({
'product': ['A', 'B', 'C'],
'revenue': [1200, 1800, 2300]
})
result = pd.concat([df_2022, df_2023], keys=['2022', '2023'])
print(result)
Output:
product revenue
2022 0 A 1000
1 B 1500
2 C 2000
2023 0 A 1200
1 B 1800
2 C 2300
Access data by year using the MultiIndex:
print(result.loc['2022'])
print(result.loc['2023', 'revenue'].sum()) # 5300
Name the index levels for clarity:
result = pd.concat([df_2022, df_2023], keys=['2022', '2023'], names=['year', 'row'])
print(result)
Column-wise Concatenation with Mismatched Indexes
When concatenating along columns with different indexes, pd.concat() performs an outer join on the index by default.
df1 = pd.DataFrame({
'A': [1, 2, 3]
}, index=[0, 1, 2])
df2 = pd.DataFrame({
'B': [4, 5, 6]
}, index=[1, 2, 3])
result = pd.concat([df1, df2], axis=1)
print(result)
Output:
A B
0 1.0 NaN
1 2.0 4.0
2 3.0 5.0
3 NaN 6.0
Use join='inner' to keep only matching indexes:
result = pd.concat([df1, df2], axis=1, join='inner')
print(result)
Output:
A B
1 2 4
2 3 5
Verifying Alignment with verify_integrity
Prevent duplicate index values in the result by setting verify_integrity=True. This raises an error if the concatenation would create duplicate index entries.
df1 = pd.DataFrame({'A': [1, 2]}, index=[0, 1])
df2 = pd.DataFrame({'A': [3, 4]}, index=[1, 2])
try:
result = pd.concat([df1, df2], verify_integrity=True)
except ValueError as e:
print(f"Error: {e}")
This catches data integrity issues early in your pipeline.
Performance Considerations
Concatenating DataFrames repeatedly in a loop is inefficient due to repeated memory allocation. Build a list and concatenate once:
import time
# Bad: repeated concatenation
start = time.time()
result = pd.DataFrame()
for i in range(100):
df = pd.DataFrame({'col': range(1000)})
result = pd.concat([result, df], ignore_index=True)
print(f"Repeated concat: {time.time() - start:.3f}s")
# Good: collect then concatenate
start = time.time()
dfs = []
for i in range(100):
df = pd.DataFrame({'col': range(1000)})
dfs.append(df)
result = pd.concat(dfs, ignore_index=True)
print(f"Single concat: {time.time() - start:.3f}s")
The single concatenation approach is orders of magnitude faster for large datasets.
Practical Example: Combining Multi-Source Data
A real-world scenario: merging sales data from multiple regional databases with different column structures.
north_sales = pd.DataFrame({
'date': ['2024-01-01', '2024-01-02'],
'product_id': [101, 102],
'revenue': [5000, 7500],
'units': [50, 75]
})
south_sales = pd.DataFrame({
'date': ['2024-01-01', '2024-01-02'],
'product_id': [103, 104],
'revenue': [6000, 8000],
'discount_applied': [True, False]
})
# Combine with region labels
combined = pd.concat(
[north_sales, south_sales],
keys=['North', 'South'],
names=['region', 'record_id']
)
# Reset index to make region a column
combined = combined.reset_index(level=0)
print(combined)
Output:
region date product_id revenue units discount_applied
0 North 2024-01-01 101 5000 50.0 NaN
1 North 2024-01-02 102 7500 75.0 NaN
0 South 2024-01-01 103 6000 NaN True
1 South 2024-01-02 104 8000 NaN False
Now analyze by region:
print(combined.groupby('region')['revenue'].sum())
This pattern handles heterogeneous data sources while preserving data lineage through the region identifier.