Pandas - Append DataFrames

Appending DataFrames is a fundamental operation in data manipulation workflows. The primary method is `pd.concat()`, which concatenates pandas objects along a particular axis with optional set logic...

Key Insights

  • The pd.concat() function is the recommended method for appending DataFrames, offering flexibility for both row-wise and column-wise concatenation with control over index handling and duplicate management.
  • The deprecated DataFrame.append() method should be avoided in favor of pd.concat() or list accumulation patterns that prevent quadratic time complexity in loops.
  • Performance optimization requires understanding when to use ignore_index=True, how to handle memory efficiently with large datasets, and when to preallocate space or use alternative approaches like list comprehension.

Understanding DataFrame Concatenation

Appending DataFrames is a fundamental operation in data manipulation workflows. The primary method is pd.concat(), which concatenates pandas objects along a particular axis with optional set logic along other axes.

import pandas as pd

df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'score': [85, 92, 78]
})

df2 = pd.DataFrame({
    'id': [4, 5],
    'name': ['David', 'Eve'],
    'score': [88, 95]
})

# Basic concatenation
result = pd.concat([df1, df2])
print(result)

Output:

   id     name  score
0   1    Alice     85
1   2      Bob     92
2   3  Charlie     78
0   4    David     88
1   5      Eve     95

Notice the index values repeat. This is often undesirable and can be addressed with the ignore_index parameter.

Controlling Index Behavior

When appending DataFrames, managing the resulting index is critical for subsequent operations. The ignore_index=True parameter creates a new sequential integer index.

result = pd.concat([df1, df2], ignore_index=True)
print(result)

Output:

   id     name  score
0   1    Alice     85
1   2      Bob     92
2   3  Charlie     78
3   4    David     88
4   5      Eve     95

For maintaining the original index with verification:

result = pd.concat([df1, df2], verify_integrity=True)

This raises a ValueError if there are duplicate index values, ensuring data integrity when index preservation matters.

Handling Column Mismatches

Real-world scenarios often involve DataFrames with different column sets. The join parameter controls how to handle this situation.

df3 = pd.DataFrame({
    'id': [6, 7],
    'name': ['Frank', 'Grace'],
    'department': ['Engineering', 'Sales']
})

# Outer join (default) - keeps all columns
result_outer = pd.concat([df1, df3], ignore_index=True)
print(result_outer)

Output:

   id     name  score   department
0   1    Alice   85.0          NaN
1   2      Bob   92.0          NaN
2   3  Charlie   78.0          NaN
3   6    Frank    NaN  Engineering
4   7    Grace    NaN        Sales

For inner join, keeping only common columns:

result_inner = pd.concat([df1, df3], join='inner', ignore_index=True)
print(result_inner)

Output:

   id     name
0   1    Alice
1   2      Bob
2   3  Charlie
3   6    Frank
4   7    Grace

Performance Considerations in Loops

A common anti-pattern involves repeatedly appending DataFrames in loops, which creates quadratic time complexity due to repeated memory allocation.

Anti-pattern:

# Avoid this - O(n²) complexity
result = pd.DataFrame()
for i in range(1000):
    temp_df = pd.DataFrame({'value': [i]})
    result = pd.concat([result, temp_df], ignore_index=True)

Optimized approach:

# Collect DataFrames in a list first
dataframes = []
for i in range(1000):
    temp_df = pd.DataFrame({'value': [i]})
    dataframes.append(temp_df)

# Single concatenation operation
result = pd.concat(dataframes, ignore_index=True)

For simple data generation, direct construction is even more efficient:

# Most efficient for simple cases
result = pd.DataFrame({'value': range(1000)})

Appending with MultiIndex

When working with hierarchical indexes, pd.concat() provides the keys parameter to create a MultiIndex.

sales_q1 = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'revenue': [10000, 15000, 12000]
})

sales_q2 = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'revenue': [11000, 16000, 13000]
})

result = pd.concat(
    [sales_q1, sales_q2],
    keys=['Q1', 'Q2'],
    names=['Quarter', 'Index']
)
print(result)

Output:

              product  revenue
Quarter Index                 
Q1      0           A    10000
        1           B    15000
        2           C    12000
Q2      0           A    11000
        1           B    16000
        2           C    13000

This structure enables powerful grouping and filtering operations:

# Access Q1 data
q1_data = result.loc['Q1']

# Calculate quarterly totals
quarterly_totals = result.groupby('Quarter')['revenue'].sum()

Memory-Efficient Concatenation

For large datasets, memory efficiency becomes paramount. Using categorical data types and chunked processing can significantly reduce memory footprint.

# Efficient concatenation with categorical conversion
df_list = []
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
    # Convert string columns to categorical
    for col in chunk.select_dtypes(include=['object']).columns:
        chunk[col] = chunk[col].astype('category')
    df_list.append(chunk)

result = pd.concat(df_list, ignore_index=True)

For extremely large datasets, consider using copy=False when appropriate:

# Avoid unnecessary copying
result = pd.concat([df1, df2], copy=False)

Note that copy=False doesn’t guarantee no copying—it’s a hint to avoid copying when possible.

Vertical vs Horizontal Concatenation

While row-wise concatenation (axis=0) is most common for appending, column-wise concatenation (axis=1) is useful for adding features.

features = pd.DataFrame({
    'id': [1, 2, 3],
    'feature1': [10, 20, 30]
})

labels = pd.DataFrame({
    'label': ['positive', 'negative', 'positive']
})

# Column-wise concatenation
combined = pd.concat([features, labels], axis=1)
print(combined)

Output:

   id  feature1     label
0   1        10  positive
1   2        20  negative
2   3        30  positive

For column concatenation with index alignment:

features = pd.DataFrame({
    'feature1': [10, 20, 30]
}, index=[1, 2, 3])

labels = pd.DataFrame({
    'label': ['positive', 'negative', 'positive', 'negative']
}, index=[1, 2, 3, 4])

# Outer join on index
combined = pd.concat([features, labels], axis=1, join='outer')

Practical Use Case: Log Aggregation

A real-world scenario demonstrating efficient DataFrame appending for log file processing:

import glob
from datetime import datetime

def process_log_files(pattern):
    """Aggregate multiple log files into a single DataFrame."""
    dataframes = []
    
    for filepath in glob.glob(pattern):
        # Read and parse log file
        df = pd.read_csv(
            filepath,
            parse_dates=['timestamp'],
            dtype={'user_id': 'category', 'action': 'category'}
        )
        
        # Add source file metadata
        df['source_file'] = filepath
        dataframes.append(df)
    
    # Single concatenation operation
    if not dataframes:
        return pd.DataFrame()
    
    combined = pd.concat(dataframes, ignore_index=True)
    
    # Sort by timestamp for chronological analysis
    combined.sort_values('timestamp', inplace=True)
    
    return combined

# Usage
logs = process_log_files('logs/*.csv')
daily_stats = logs.groupby(logs['timestamp'].dt.date).size()

This pattern efficiently processes multiple files, maintains data types, and provides a clean aggregated dataset for analysis.

Handling Duplicate Data

When appending DataFrames, duplicates may occur. Address this with the drop_duplicates() method post-concatenation:

result = pd.concat([df1, df2, df1], ignore_index=True)
result_clean = result.drop_duplicates(subset=['id'], keep='first')

For more control, use verify_integrity with unique constraints or implement custom validation logic to ensure data quality during the concatenation process.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.