How to Concatenate DataFrames in Pandas
Concatenation in Pandas means combining two or more DataFrames into a single DataFrame. Unlike merging, which combines data based on shared keys (similar to SQL joins), concatenation simply glues...
Key Insights
- Use
pd.concat()for stacking DataFrames with similar structures, reservingmerge()andjoin()for database-style operations based on keys or indices. - Always concatenate multiple DataFrames in a single
pd.concat()call using a list—never concatenate repeatedly in a loop, as this causes quadratic memory allocation. - The
ignore_index=Trueparameter prevents duplicate index values and is almost always what you want when stacking rows vertically.
Introduction to DataFrame Concatenation
Concatenation in Pandas means combining two or more DataFrames into a single DataFrame. Unlike merging, which combines data based on shared keys (similar to SQL joins), concatenation simply glues DataFrames together along an axis—either stacking rows vertically or placing columns side-by-side horizontally.
The distinction matters. Use pd.concat() when you have DataFrames with the same structure that need combining—like monthly sales reports or data from multiple CSV files. Use merge() or join() when you need to combine data based on matching values in specific columns.
Let’s set up our working environment:
import pandas as pd
# Sample DataFrames for demonstration
df1 = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket'],
'price': [25.99, 49.99, 12.50],
'quantity': [100, 50, 200]
})
df2 = pd.DataFrame({
'product': ['Bolt', 'Nut', 'Washer'],
'price': [0.99, 0.50, 0.25],
'quantity': [1000, 2000, 5000]
})
print(df1)
# product price quantity
# 0 Widget 25.99 100
# 1 Gadget 49.99 50
# 2 Sprocket 12.50 200
Basic Vertical Concatenation (Stacking Rows)
The most common concatenation operation stacks DataFrames on top of each other. This is the default behavior of pd.concat():
combined = pd.concat([df1, df2])
print(combined)
# product price quantity
# 0 Widget 25.99 100
# 1 Gadget 49.99 50
# 2 Sprocket 12.50 200
# 0 Bolt 0.99 1000
# 1 Nut 0.50 2000
# 2 Washer 0.25 5000
Notice the index values. Both DataFrames had indices 0, 1, 2, and Pandas preserved them. This creates duplicate index values, which can cause problems when you try to select rows by index.
The axis=0 parameter controls this behavior explicitly. It’s the default, meaning “concatenate along the row axis” (stack vertically):
# Explicit axis parameter - same result as above
combined = pd.concat([df1, df2], axis=0)
You can concatenate more than two DataFrames at once:
df3 = pd.DataFrame({
'product': ['Gear', 'Cog'],
'price': [8.99, 5.50],
'quantity': [300, 400]
})
all_products = pd.concat([df1, df2, df3])
print(len(all_products)) # 8 rows total
Horizontal Concatenation (Adding Columns)
Setting axis=1 concatenates DataFrames side-by-side, adding columns:
df_info = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket'],
'price': [25.99, 49.99, 12.50]
})
df_stock = pd.DataFrame({
'quantity': [100, 50, 200],
'warehouse': ['A', 'B', 'A']
})
combined = pd.concat([df_info, df_stock], axis=1)
print(combined)
# product price quantity warehouse
# 0 Widget 25.99 100 A
# 1 Gadget 49.99 50 B
# 2 Sprocket 12.50 200 A
Horizontal concatenation aligns rows by index. If indices don’t match, you get NaN values:
df_a = pd.DataFrame({'col_a': [1, 2, 3]}, index=[0, 1, 2])
df_b = pd.DataFrame({'col_b': [4, 5, 6]}, index=[1, 2, 3])
combined = pd.concat([df_a, df_b], axis=1)
print(combined)
# col_a col_b
# 0 1.0 NaN
# 1 2.0 4.0
# 2 3.0 5.0
# 3 NaN 6.0
This behavior catches people off guard. If you expect a simple side-by-side join and your indices don’t align, you’ll get unexpected NaN values.
Handling Index Behavior
The ignore_index parameter solves the duplicate index problem from vertical concatenation:
combined = pd.concat([df1, df2], ignore_index=True)
print(combined)
# product price quantity
# 0 Widget 25.99 100
# 1 Gadget 49.99 50
# 2 Sprocket 12.50 200
# 3 Bolt 0.99 1000
# 4 Nut 0.50 2000
# 5 Washer 0.25 5000
Now you have a clean sequential index. This is almost always what you want when stacking DataFrames vertically.
The keys parameter creates a hierarchical (multi-level) index, useful when you need to track which original DataFrame each row came from:
combined = pd.concat([df1, df2], keys=['premium', 'budget'])
print(combined)
# product price quantity
# premium 0 Widget 25.99 100
# 1 Gadget 49.99 50
# 2 Sprocket 12.50 200
# budget 0 Bolt 0.99 1000
# 1 Nut 0.50 2000
# 2 Washer 0.25 5000
# Access specific group
print(combined.loc['premium'])
# product price quantity
# 0 Widget 25.99 100
# 1 Gadget 49.99 50
# 2 Sprocket 12.50 200
This pattern works well when concatenating data from different sources, time periods, or categories where you need to preserve that grouping information.
Managing Mismatched Columns/Rows
Real-world data rarely aligns perfectly. When concatenating DataFrames with different columns, the join parameter controls behavior:
df_full = pd.DataFrame({
'product': ['Widget', 'Gadget'],
'price': [25.99, 49.99],
'quantity': [100, 50],
'supplier': ['Acme', 'Beta']
})
df_partial = pd.DataFrame({
'product': ['Bolt', 'Nut'],
'price': [0.99, 0.50],
'quantity': [1000, 2000]
})
# Default: outer join - keeps all columns, fills missing with NaN
combined_outer = pd.concat([df_full, df_partial], ignore_index=True)
print(combined_outer)
# product price quantity supplier
# 0 Widget 25.99 100 Acme
# 1 Gadget 49.99 50 Beta
# 2 Bolt 0.99 1000 NaN
# 3 Nut 0.50 2000 NaN
# Inner join - keeps only common columns
combined_inner = pd.concat([df_full, df_partial], join='inner', ignore_index=True)
print(combined_inner)
# product price quantity
# 0 Widget 25.99 100
# 1 Gadget 49.99 50
# 2 Bolt 0.99 1000
# 3 Nut 0.50 2000
The outer join (default) preserves all data but introduces NaN values. The inner join drops columns that don’t exist in all DataFrames. Choose based on whether missing data is acceptable in your use case.
For horizontal concatenation with mismatched indices, the same logic applies:
df_x = pd.DataFrame({'a': [1, 2]}, index=['row1', 'row2'])
df_y = pd.DataFrame({'b': [3, 4]}, index=['row2', 'row3'])
# Outer join keeps all rows
print(pd.concat([df_x, df_y], axis=1, join='outer'))
# a b
# row1 1.0 NaN
# row2 2.0 3.0
# row3 NaN 4.0
# Inner join keeps only matching rows
print(pd.concat([df_x, df_y], axis=1, join='inner'))
# a b
# row2 2 3
Performance Tips and Best Practices
The biggest performance mistake with concatenation is doing it inside a loop:
# BAD - Don't do this
result = pd.DataFrame()
for filename in file_list:
df = pd.read_csv(filename)
result = pd.concat([result, df]) # Creates new DataFrame each iteration
This pattern has O(n²) memory complexity. Each concatenation creates a new DataFrame and copies all existing data. With 100 files of 10,000 rows each, you’re copying millions of rows repeatedly.
The correct approach collects DataFrames in a list, then concatenates once:
# GOOD - Do this instead
dataframes = []
for filename in file_list:
df = pd.read_csv(filename)
dataframes.append(df)
result = pd.concat(dataframes, ignore_index=True)
Or more pythonically with a list comprehension:
# BETTER - Clean and efficient
dataframes = [pd.read_csv(f) for f in file_list]
result = pd.concat(dataframes, ignore_index=True)
This approach has O(n) memory complexity. You allocate memory for the final DataFrame once.
For very large concatenations, consider specifying copy=False to avoid unnecessary data copying when possible:
result = pd.concat(dataframes, ignore_index=True, copy=False)
Quick Reference Summary
| Parameter | Default | Purpose |
|---|---|---|
objs |
Required | List of DataFrames to concatenate |
axis |
0 | 0 for vertical (rows), 1 for horizontal (columns) |
join |
‘outer’ | ‘outer’ keeps all, ‘inner’ keeps only common |
ignore_index |
False | True resets index to 0, 1, 2… |
keys |
None | Creates hierarchical index from labels |
copy |
True | False avoids unnecessary copying |
Common Use Cases:
- Combining CSV files:
pd.concat([pd.read_csv(f) for f in files], ignore_index=True) - Stacking monthly data:
pd.concat([jan_df, feb_df, mar_df], keys=['Jan', 'Feb', 'Mar']) - Adding calculated columns:
pd.concat([df, new_columns_df], axis=1) - Combining with source tracking:
pd.concat(dfs, keys=source_names)
Concatenation is straightforward once you understand the axis and join parameters. Remember: collect first, concatenate once, and let ignore_index=True handle your index cleanup.