Pandas - Concatenate DataFrames (concat)

The default behavior of `pd.concat()` stacks DataFrames vertically, appending rows from multiple DataFrames into a single structure. This is the most common use case when combining datasets with...

Key Insights

  • pd.concat() combines DataFrames along rows or columns with fine-grained control over index handling, join types, and hierarchical labeling through the keys parameter
  • Understanding axis orientation (0 for rows, 1 for columns) and join strategies (inner vs outer) prevents data loss and unexpected results when merging datasets with mismatched columns or indexes
  • Performance matters: concatenating large datasets benefits from pre-allocation strategies and avoiding repeated concatenation in loops—build a list and concatenate once

Basic Concatenation Along Rows

The default behavior of pd.concat() stacks DataFrames vertically, appending rows from multiple DataFrames into a single structure. This is the most common use case when combining datasets with identical or similar column structures.

import pandas as pd

df1 = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'sales': [100, 150, 200],
    'region': ['North', 'South', 'East']
})

df2 = pd.DataFrame({
    'product': ['D', 'E', 'F'],
    'sales': [175, 225, 190],
    'region': ['West', 'North', 'South']
})

result = pd.concat([df1, df2])
print(result)

Output:

  product  sales region
0       A    100  North
1       B    150  South
2       C    200   East
0       D    175   West
1       E    225  North
2       F    190  South

Notice the index values repeat. Use ignore_index=True to create a continuous integer index:

result = pd.concat([df1, df2], ignore_index=True)
print(result)

Output:

  product  sales region
0       A    100  North
1       B    150  South
2       C    200   East
3       D    175   West
4       E    225  North
5       F    190  South

Concatenation Along Columns

Set axis=1 to combine DataFrames horizontally, joining them side-by-side. This operation aligns DataFrames based on their index values.

df_sales = pd.DataFrame({
    'Q1': [100, 150, 200],
    'Q2': [110, 160, 210]
}, index=['ProductA', 'ProductB', 'ProductC'])

df_costs = pd.DataFrame({
    'Q1_cost': [80, 120, 150],
    'Q2_cost': [85, 125, 155]
}, index=['ProductA', 'ProductB', 'ProductC'])

result = pd.concat([df_sales, df_costs], axis=1)
print(result)

Output:

           Q1   Q2  Q1_cost  Q2_cost
ProductA  100  110       80       85
ProductB  150  160      120      125
ProductC  200  210      150      155

Handling Mismatched Columns with Join Strategies

When DataFrames have different columns, pd.concat() performs an outer join by default, including all columns and filling missing values with NaN. Use the join parameter to control this behavior.

df1 = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

df2 = pd.DataFrame({
    'B': [7, 8, 9],
    'C': [10, 11, 12]
})

# Outer join (default)
outer_result = pd.concat([df1, df2], ignore_index=True)
print("Outer join:")
print(outer_result)

# Inner join - only common columns
inner_result = pd.concat([df1, df2], join='inner', ignore_index=True)
print("\nInner join:")
print(inner_result)

Output:

Outer join:
     A  B     C
0  1.0  4   NaN
1  2.0  5   NaN
2  3.0  6   NaN
3  NaN  7  10.0
4  NaN  8  11.0
5  NaN  9  12.0

Inner join:
   B
0  4
1  5
2  6
3  7
4  8
5  9

Using Keys for Hierarchical Indexing

The keys parameter creates a MultiIndex, labeling which DataFrame each row originated from. This is invaluable for tracking data sources after concatenation.

df_2022 = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'revenue': [1000, 1500, 2000]
})

df_2023 = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'revenue': [1200, 1800, 2300]
})

result = pd.concat([df_2022, df_2023], keys=['2022', '2023'])
print(result)

Output:

        product  revenue
2022 0        A     1000
     1        B     1500
     2        C     2000
2023 0        A     1200
     1        B     1800
     2        C     2300

Access data by year using the MultiIndex:

print(result.loc['2022'])
print(result.loc['2023', 'revenue'].sum())  # 5300

Name the index levels for clarity:

result = pd.concat([df_2022, df_2023], keys=['2022', '2023'], names=['year', 'row'])
print(result)

Column-wise Concatenation with Mismatched Indexes

When concatenating along columns with different indexes, pd.concat() performs an outer join on the index by default.

df1 = pd.DataFrame({
    'A': [1, 2, 3]
}, index=[0, 1, 2])

df2 = pd.DataFrame({
    'B': [4, 5, 6]
}, index=[1, 2, 3])

result = pd.concat([df1, df2], axis=1)
print(result)

Output:

     A    B
0  1.0  NaN
1  2.0  4.0
2  3.0  5.0
3  NaN  6.0

Use join='inner' to keep only matching indexes:

result = pd.concat([df1, df2], axis=1, join='inner')
print(result)

Output:

   A  B
1  2  4
2  3  5

Verifying Alignment with verify_integrity

Prevent duplicate index values in the result by setting verify_integrity=True. This raises an error if the concatenation would create duplicate index entries.

df1 = pd.DataFrame({'A': [1, 2]}, index=[0, 1])
df2 = pd.DataFrame({'A': [3, 4]}, index=[1, 2])

try:
    result = pd.concat([df1, df2], verify_integrity=True)
except ValueError as e:
    print(f"Error: {e}")

This catches data integrity issues early in your pipeline.

Performance Considerations

Concatenating DataFrames repeatedly in a loop is inefficient due to repeated memory allocation. Build a list and concatenate once:

import time

# Bad: repeated concatenation
start = time.time()
result = pd.DataFrame()
for i in range(100):
    df = pd.DataFrame({'col': range(1000)})
    result = pd.concat([result, df], ignore_index=True)
print(f"Repeated concat: {time.time() - start:.3f}s")

# Good: collect then concatenate
start = time.time()
dfs = []
for i in range(100):
    df = pd.DataFrame({'col': range(1000)})
    dfs.append(df)
result = pd.concat(dfs, ignore_index=True)
print(f"Single concat: {time.time() - start:.3f}s")

The single concatenation approach is orders of magnitude faster for large datasets.

Practical Example: Combining Multi-Source Data

A real-world scenario: merging sales data from multiple regional databases with different column structures.

north_sales = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-02'],
    'product_id': [101, 102],
    'revenue': [5000, 7500],
    'units': [50, 75]
})

south_sales = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-02'],
    'product_id': [103, 104],
    'revenue': [6000, 8000],
    'discount_applied': [True, False]
})

# Combine with region labels
combined = pd.concat(
    [north_sales, south_sales],
    keys=['North', 'South'],
    names=['region', 'record_id']
)

# Reset index to make region a column
combined = combined.reset_index(level=0)
print(combined)

Output:

  region        date  product_id  revenue  units  discount_applied
0  North  2024-01-01         101     5000   50.0               NaN
1  North  2024-01-02         102     7500   75.0               NaN
0  South  2024-01-01         103     6000    NaN              True
1  South  2024-01-02         104     8000    NaN             False

Now analyze by region:

print(combined.groupby('region')['revenue'].sum())

This pattern handles heterogeneous data sources while preserving data lineage through the region identifier.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.