Pandas - Concatenate Along Rows vs Columns

• `pd.concat()` uses the `axis` parameter to control concatenation direction: `axis=0` stacks DataFrames vertically (along rows), while `axis=1` joins them horizontally (along columns)

Key Insights

pd.concat() uses the axis parameter to control concatenation direction: axis=0 stacks DataFrames vertically (along rows), while axis=1 joins them horizontally (along columns) • Row concatenation preserves column structure and extends the dataset downward, while column concatenation preserves row indices and extends the dataset sideways—choosing the wrong axis is a common source of data pipeline bugs • Index handling through ignore_index, keys, and join parameters determines how duplicate indices are managed and how mismatched data structures are aligned during concatenation

Understanding Axis Direction in Concatenation

The axis parameter in pd.concat() determines the direction of concatenation. This isn’t just a technical detail—it fundamentally changes how your data combines and which dimension grows.

import pandas as pd
import numpy as np

# Create sample DataFrames
df1 = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
}, index=['row1', 'row2', 'row3'])

df2 = pd.DataFrame({
    'A': [7, 8, 9],
    'B': [10, 11, 12]
}, index=['row4', 'row5', 'row6'])

# Concatenate along rows (axis=0, default)
result_rows = pd.concat([df1, df2], axis=0)
print("Row concatenation (axis=0):")
print(result_rows)
print(f"Shape: {result_rows.shape}\n")

# Concatenate along columns (axis=1)
result_cols = pd.concat([df1, df2], axis=1)
print("Column concatenation (axis=1):")
print(result_cols)
print(f"Shape: {result_cols.shape}")

Output:

Row concatenation (axis=0):
      A   B
row1  1   4
row2  2   5
row3  3   6
row4  7  10
row5  8  11
row6  9  12
Shape: (6, 2)

Column concatenation (axis=1):
      A   B   A   B
row1  1.0 4.0 NaN NaN
row2  2.0 5.0 NaN NaN
row3  3.0 6.0 NaN NaN
row4  NaN NaN 7.0 10.0
row5  NaN NaN 8.0 11.0
row6  NaN NaN 9.0 12.0
Shape: (6, 4)

Notice that column concatenation produces NaN values because the row indices don’t overlap. This behavior is critical for understanding how pandas handles mismatched indices.

Row Concatenation: Stacking Data Vertically

Row concatenation (axis=0) is the default behavior and most common use case. You’re appending one dataset below another, like stacking papers on a desk.

# Realistic scenario: combining monthly sales data
jan_sales = pd.DataFrame({
    'product_id': [101, 102, 103],
    'units_sold': [50, 75, 30],
    'revenue': [5000, 7500, 3000]
})

feb_sales = pd.DataFrame({
    'product_id': [101, 102, 104],
    'units_sold': [60, 80, 45],
    'revenue': [6000, 8000, 4500]
})

# Combine monthly data
quarterly_sales = pd.concat([jan_sales, feb_sales], axis=0)
print(quarterly_sales)

Output:

   product_id  units_sold  revenue
0         101          50     5000
1         102          75     7500
2         103          30     3000
0         101          60     6000
1         102          80     8000
2         104          45     4500

The duplicate indices (0, 1, 2) appear because each DataFrame had its own index. Use ignore_index=True to create a continuous index:

quarterly_sales = pd.concat([jan_sales, feb_sales], axis=0, ignore_index=True)
print(quarterly_sales)

Output:

   product_id  units_sold  revenue
0         101          50     5000
1         102          75     7500
2         103          30     3000
3         101          60     6000
4         102          80     8000
5         104          45     4500

Column Concatenation: Joining Data Horizontally

Column concatenation (axis=1) joins DataFrames side-by-side, matching on index values. This is useful when you have different attributes for the same entities.

# Customer demographics
demographics = pd.DataFrame({
    'age': [25, 35, 45],
    'city': ['NYC', 'LA', 'Chicago']
}, index=['cust_001', 'cust_002', 'cust_003'])

# Customer purchase behavior
purchases = pd.DataFrame({
    'total_spent': [1500, 2500, 800],
    'num_orders': [5, 10, 3]
}, index=['cust_001', 'cust_002', 'cust_003'])

# Combine customer data
customer_profile = pd.concat([demographics, purchases], axis=1)
print(customer_profile)

Output:

          age     city  total_spent  num_orders
cust_001   25      NYC         1500           5
cust_002   35       LA         2500          10
cust_003   45  Chicago          800           3

When indices match perfectly, column concatenation works seamlessly. The real complexity emerges with mismatched indices.

Handling Mismatched Indices and Columns

The join parameter controls how pandas handles mismatched indices during concatenation. It accepts 'outer' (default) or 'inner'.

# DataFrames with partial overlap
df_a = pd.DataFrame({
    'metric1': [10, 20, 30]
}, index=['A', 'B', 'C'])

df_b = pd.DataFrame({
    'metric2': [40, 50, 60]
}, index=['B', 'C', 'D'])

# Outer join (default): includes all indices
outer_result = pd.concat([df_a, df_b], axis=1, join='outer')
print("Outer join:")
print(outer_result)

# Inner join: only matching indices
inner_result = pd.concat([df_a, df_b], axis=1, join='inner')
print("\nInner join:")
print(inner_result)

Output:

Outer join:
   metric1  metric2
A     10.0      NaN
B     20.0     40.0
C     30.0     50.0
D      NaN     60.0

Inner join:
   metric1  metric2
B       20       40
C       30       50

For row concatenation with different columns, the same logic applies:

df1 = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
})

df2 = pd.DataFrame({
    'B': [5, 6],
    'C': [7, 8]
})

# Outer join: all columns
print(pd.concat([df1, df2], axis=0, join='outer'))

# Inner join: only common columns
print(pd.concat([df1, df2], axis=0, join='inner'))

Using Keys for Multi-Level Indexing

The keys parameter adds a hierarchical index level, making it easy to track which original DataFrame each row or column came from.

# Combine data with source tracking
q1_data = pd.DataFrame({
    'revenue': [100, 150, 200]
}, index=['Jan', 'Feb', 'Mar'])

q2_data = pd.DataFrame({
    'revenue': [180, 220, 250]
}, index=['Apr', 'May', 'Jun'])

# Add quarter labels
yearly = pd.concat(
    [q1_data, q2_data],
    keys=['Q1', 'Q2'],
    names=['Quarter', 'Month']
)

print(yearly)
print("\nAccess Q1 data:")
print(yearly.loc['Q1'])

Output:

              revenue
Quarter Month        
Q1      Jan       100
        Feb       150
        Mar       200
Q2      Apr       180
        May       220
        Jun       250

Access Q1 data:
       revenue
Month         
Jan        100
Feb        150
Mar        200

For column concatenation, keys create multi-level column headers:

actual = pd.DataFrame({'sales': [100, 200]}, index=['Store1', 'Store2'])
forecast = pd.DataFrame({'sales': [95, 210]}, index=['Store1', 'Store2'])

comparison = pd.concat([actual, forecast], axis=1, keys=['Actual', 'Forecast'])
print(comparison)

Output:

        Actual Forecast
         sales    sales
Store1     100       95
Store2     200      210

Performance Considerations and Best Practices

Concatenating many DataFrames in a loop is inefficient. Collect them in a list first, then concatenate once:

import time

# Inefficient: concatenating in loop
start = time.time()
result = pd.DataFrame()
for i in range(100):
    df = pd.DataFrame({'col': [i]})
    result = pd.concat([result, df], ignore_index=True)
print(f"Loop concatenation: {time.time() - start:.4f}s")

# Efficient: collect then concatenate
start = time.time()
dfs = []
for i in range(100):
    df = pd.DataFrame({'col': [i]})
    dfs.append(df)
result = pd.concat(dfs, ignore_index=True)
print(f"Batch concatenation: {time.time() - start:.4f}s")

When working with large datasets, verify alignment before concatenating:

def safe_concat(dfs, axis=0, **kwargs):
    """Concatenate with validation."""
    if axis == 0:
        # Check column consistency
        cols = [set(df.columns) for df in dfs]
        if len(set(map(tuple, cols))) > 1:
            print("Warning: Column mismatch detected")
            for i, col_set in enumerate(cols):
                print(f"  DataFrame {i}: {col_set}")
    
    return pd.concat(dfs, axis=axis, **kwargs)

# Usage
df1 = pd.DataFrame({'A': [1], 'B': [2]})
df2 = pd.DataFrame({'A': [3], 'C': [4]})
result = safe_concat([df1, df2], axis=0)

Understanding axis direction, index alignment, and join behavior transforms pd.concat() from a source of bugs into a precise data manipulation tool. Choose axis=0 when extending your dataset with more observations, axis=1 when adding features to existing observations, and always verify index alignment for your specific use case.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.