How to Concatenate DataFrames in Pandas

Key Insights

Use pd.concat() for stacking DataFrames with similar structures, reserving merge() and join() for database-style operations based on keys or indices.
Always concatenate multiple DataFrames in a single pd.concat() call using a list—never concatenate repeatedly in a loop, as this causes quadratic memory allocation.
The ignore_index=True parameter prevents duplicate index values and is almost always what you want when stacking rows vertically.

Introduction to DataFrame Concatenation

Concatenation in Pandas means combining two or more DataFrames into a single DataFrame. Unlike merging, which combines data based on shared keys (similar to SQL joins), concatenation simply glues DataFrames together along an axis—either stacking rows vertically or placing columns side-by-side horizontally.

The distinction matters. Use pd.concat() when you have DataFrames with the same structure that need combining—like monthly sales reports or data from multiple CSV files. Use merge() or join() when you need to combine data based on matching values in specific columns.

Let’s set up our working environment:

import pandas as pd

# Sample DataFrames for demonstration
df1 = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Sprocket'],
    'price': [25.99, 49.99, 12.50],
    'quantity': [100, 50, 200]
})

df2 = pd.DataFrame({
    'product': ['Bolt', 'Nut', 'Washer'],
    'price': [0.99, 0.50, 0.25],
    'quantity': [1000, 2000, 5000]
})

print(df1)
#    product  price  quantity
# 0   Widget  25.99       100
# 1   Gadget  49.99        50
# 2  Sprocket  12.50       200

Basic Vertical Concatenation (Stacking Rows)

The most common concatenation operation stacks DataFrames on top of each other. This is the default behavior of pd.concat():

combined = pd.concat([df1, df2])
print(combined)
#    product  price  quantity
# 0   Widget  25.99       100
# 1   Gadget  49.99        50
# 2  Sprocket  12.50       200
# 0     Bolt   0.99      1000
# 1      Nut   0.50      2000
# 2   Washer   0.25      5000

Notice the index values. Both DataFrames had indices 0, 1, 2, and Pandas preserved them. This creates duplicate index values, which can cause problems when you try to select rows by index.

The axis=0 parameter controls this behavior explicitly. It’s the default, meaning “concatenate along the row axis” (stack vertically):

# Explicit axis parameter - same result as above
combined = pd.concat([df1, df2], axis=0)

You can concatenate more than two DataFrames at once:

df3 = pd.DataFrame({
    'product': ['Gear', 'Cog'],
    'price': [8.99, 5.50],
    'quantity': [300, 400]
})

all_products = pd.concat([df1, df2, df3])
print(len(all_products))  # 8 rows total

Horizontal Concatenation (Adding Columns)

Setting axis=1 concatenates DataFrames side-by-side, adding columns:

df_info = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Sprocket'],
    'price': [25.99, 49.99, 12.50]
})

df_stock = pd.DataFrame({
    'quantity': [100, 50, 200],
    'warehouse': ['A', 'B', 'A']
})

combined = pd.concat([df_info, df_stock], axis=1)
print(combined)
#    product  price  quantity warehouse
# 0   Widget  25.99       100         A
# 1   Gadget  49.99        50         B
# 2  Sprocket  12.50       200         A

Horizontal concatenation aligns rows by index. If indices don’t match, you get NaN values:

df_a = pd.DataFrame({'col_a': [1, 2, 3]}, index=[0, 1, 2])
df_b = pd.DataFrame({'col_b': [4, 5, 6]}, index=[1, 2, 3])

combined = pd.concat([df_a, df_b], axis=1)
print(combined)
#    col_a  col_b
# 0    1.0    NaN
# 1    2.0    4.0
# 2    3.0    5.0
# 3    NaN    6.0

This behavior catches people off guard. If you expect a simple side-by-side join and your indices don’t align, you’ll get unexpected NaN values.

Handling Index Behavior

The ignore_index parameter solves the duplicate index problem from vertical concatenation:

combined = pd.concat([df1, df2], ignore_index=True)
print(combined)
#    product  price  quantity
# 0   Widget  25.99       100
# 1   Gadget  49.99        50
# 2  Sprocket  12.50       200
# 3     Bolt   0.99      1000
# 4      Nut   0.50      2000
# 5   Washer   0.25      5000

Now you have a clean sequential index. This is almost always what you want when stacking DataFrames vertically.

The keys parameter creates a hierarchical (multi-level) index, useful when you need to track which original DataFrame each row came from:

combined = pd.concat([df1, df2], keys=['premium', 'budget'])
print(combined)
#              product  price  quantity
# premium 0    Widget  25.99       100
#         1    Gadget  49.99        50
#         2   Sprocket  12.50       200
# budget  0      Bolt   0.99      1000
#         1       Nut   0.50      2000
#         2    Washer   0.25      5000

# Access specific group
print(combined.loc['premium'])
#    product  price  quantity
# 0   Widget  25.99       100
# 1   Gadget  49.99        50
# 2  Sprocket  12.50       200

This pattern works well when concatenating data from different sources, time periods, or categories where you need to preserve that grouping information.

Managing Mismatched Columns/Rows

Real-world data rarely aligns perfectly. When concatenating DataFrames with different columns, the join parameter controls behavior:

df_full = pd.DataFrame({
    'product': ['Widget', 'Gadget'],
    'price': [25.99, 49.99],
    'quantity': [100, 50],
    'supplier': ['Acme', 'Beta']
})

df_partial = pd.DataFrame({
    'product': ['Bolt', 'Nut'],
    'price': [0.99, 0.50],
    'quantity': [1000, 2000]
})

# Default: outer join - keeps all columns, fills missing with NaN
combined_outer = pd.concat([df_full, df_partial], ignore_index=True)
print(combined_outer)
#   product  price  quantity supplier
# 0  Widget  25.99       100     Acme
# 1  Gadget  49.99        50     Beta
# 2    Bolt   0.99      1000      NaN
# 3     Nut   0.50      2000      NaN

# Inner join - keeps only common columns
combined_inner = pd.concat([df_full, df_partial], join='inner', ignore_index=True)
print(combined_inner)
#   product  price  quantity
# 0  Widget  25.99       100
# 1  Gadget  49.99        50
# 2    Bolt   0.99      1000
# 3     Nut   0.50      2000

The outer join (default) preserves all data but introduces NaN values. The inner join drops columns that don’t exist in all DataFrames. Choose based on whether missing data is acceptable in your use case.

For horizontal concatenation with mismatched indices, the same logic applies:

df_x = pd.DataFrame({'a': [1, 2]}, index=['row1', 'row2'])
df_y = pd.DataFrame({'b': [3, 4]}, index=['row2', 'row3'])

# Outer join keeps all rows
print(pd.concat([df_x, df_y], axis=1, join='outer'))
#         a    b
# row1  1.0  NaN
# row2  2.0  3.0
# row3  NaN  4.0

# Inner join keeps only matching rows
print(pd.concat([df_x, df_y], axis=1, join='inner'))
#       a  b
# row2  2  3

Performance Tips and Best Practices

The biggest performance mistake with concatenation is doing it inside a loop:

# BAD - Don't do this
result = pd.DataFrame()
for filename in file_list:
    df = pd.read_csv(filename)
    result = pd.concat([result, df])  # Creates new DataFrame each iteration

This pattern has O(n²) memory complexity. Each concatenation creates a new DataFrame and copies all existing data. With 100 files of 10,000 rows each, you’re copying millions of rows repeatedly.

The correct approach collects DataFrames in a list, then concatenates once:

# GOOD - Do this instead
dataframes = []
for filename in file_list:
    df = pd.read_csv(filename)
    dataframes.append(df)

result = pd.concat(dataframes, ignore_index=True)

Or more pythonically with a list comprehension:

# BETTER - Clean and efficient
dataframes = [pd.read_csv(f) for f in file_list]
result = pd.concat(dataframes, ignore_index=True)

This approach has O(n) memory complexity. You allocate memory for the final DataFrame once.

For very large concatenations, consider specifying copy=False to avoid unnecessary data copying when possible:

result = pd.concat(dataframes, ignore_index=True, copy=False)

Quick Reference Summary

Parameter	Default	Purpose
`objs`	Required	List of DataFrames to concatenate
`axis`	0	0 for vertical (rows), 1 for horizontal (columns)
`join`	‘outer’	‘outer’ keeps all, ‘inner’ keeps only common
`ignore_index`	False	True resets index to 0, 1, 2…
`keys`	None	Creates hierarchical index from labels
`copy`	True	False avoids unnecessary copying

Common Use Cases:

Combining CSV files: pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
Stacking monthly data: pd.concat([jan_df, feb_df, mar_df], keys=['Jan', 'Feb', 'Mar'])
Adding calculated columns: pd.concat([df, new_columns_df], axis=1)
Combining with source tracking: pd.concat(dfs, keys=source_names)

Concatenation is straightforward once you understand the axis and join parameters. Remember: collect first, concatenate once, and let ignore_index=True handle your index cleanup.