Pandas - Append DataFrames | Application Architect

Key Insights

The pd.concat() function is the recommended method for appending DataFrames, offering flexibility for both row-wise and column-wise concatenation with control over index handling and duplicate management.
The deprecated DataFrame.append() method should be avoided in favor of pd.concat() or list accumulation patterns that prevent quadratic time complexity in loops.
Performance optimization requires understanding when to use ignore_index=True, how to handle memory efficiently with large datasets, and when to preallocate space or use alternative approaches like list comprehension.

Understanding DataFrame Concatenation

Appending DataFrames is a fundamental operation in data manipulation workflows. The primary method is pd.concat(), which concatenates pandas objects along a particular axis with optional set logic along other axes.

import pandas as pd

df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'score': [85, 92, 78]
})

df2 = pd.DataFrame({
    'id': [4, 5],
    'name': ['David', 'Eve'],
    'score': [88, 95]
})

# Basic concatenation
result = pd.concat([df1, df2])
print(result)

Output:

   id     name  score
0   1    Alice     85
1   2      Bob     92
2   3  Charlie     78
0   4    David     88
1   5      Eve     95

Notice the index values repeat. This is often undesirable and can be addressed with the ignore_index parameter.

Controlling Index Behavior

When appending DataFrames, managing the resulting index is critical for subsequent operations. The ignore_index=True parameter creates a new sequential integer index.

result = pd.concat([df1, df2], ignore_index=True)
print(result)

Output:

   id     name  score
0   1    Alice     85
1   2      Bob     92
2   3  Charlie     78
3   4    David     88
4   5      Eve     95

For maintaining the original index with verification:

result = pd.concat([df1, df2], verify_integrity=True)

This raises a ValueError if there are duplicate index values, ensuring data integrity when index preservation matters.

Handling Column Mismatches

Real-world scenarios often involve DataFrames with different column sets. The join parameter controls how to handle this situation.

df3 = pd.DataFrame({
    'id': [6, 7],
    'name': ['Frank', 'Grace'],
    'department': ['Engineering', 'Sales']
})

# Outer join (default) - keeps all columns
result_outer = pd.concat([df1, df3], ignore_index=True)
print(result_outer)

Output:

   id     name  score   department
0   1    Alice   85.0          NaN
1   2      Bob   92.0          NaN
2   3  Charlie   78.0          NaN
3   6    Frank    NaN  Engineering
4   7    Grace    NaN        Sales

For inner join, keeping only common columns:

result_inner = pd.concat([df1, df3], join='inner', ignore_index=True)
print(result_inner)

Output:

   id     name
0   1    Alice
1   2      Bob
2   3  Charlie
3   6    Frank
4   7    Grace

Performance Considerations in Loops

A common anti-pattern involves repeatedly appending DataFrames in loops, which creates quadratic time complexity due to repeated memory allocation.

Anti-pattern:

# Avoid this - O(n²) complexity
result = pd.DataFrame()
for i in range(1000):
    temp_df = pd.DataFrame({'value': [i]})
    result = pd.concat([result, temp_df], ignore_index=True)

Optimized approach:

# Collect DataFrames in a list first
dataframes = []
for i in range(1000):
    temp_df = pd.DataFrame({'value': [i]})
    dataframes.append(temp_df)

# Single concatenation operation
result = pd.concat(dataframes, ignore_index=True)

For simple data generation, direct construction is even more efficient:

# Most efficient for simple cases
result = pd.DataFrame({'value': range(1000)})

Appending with MultiIndex

When working with hierarchical indexes, pd.concat() provides the keys parameter to create a MultiIndex.

sales_q1 = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'revenue': [10000, 15000, 12000]
})

sales_q2 = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'revenue': [11000, 16000, 13000]
})

result = pd.concat(
    [sales_q1, sales_q2],
    keys=['Q1', 'Q2'],
    names=['Quarter', 'Index']
)
print(result)

Output:

              product  revenue
Quarter Index                 
Q1      0           A    10000
        1           B    15000
        2           C    12000
Q2      0           A    11000
        1           B    16000
        2           C    13000

This structure enables powerful grouping and filtering operations:

# Access Q1 data
q1_data = result.loc['Q1']

# Calculate quarterly totals
quarterly_totals = result.groupby('Quarter')['revenue'].sum()

Memory-Efficient Concatenation

For large datasets, memory efficiency becomes paramount. Using categorical data types and chunked processing can significantly reduce memory footprint.

# Efficient concatenation with categorical conversion
df_list = []
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
    # Convert string columns to categorical
    for col in chunk.select_dtypes(include=['object']).columns:
        chunk[col] = chunk[col].astype('category')
    df_list.append(chunk)

result = pd.concat(df_list, ignore_index=True)

For extremely large datasets, consider using copy=False when appropriate:

# Avoid unnecessary copying
result = pd.concat([df1, df2], copy=False)

Note that copy=False doesn’t guarantee no copying—it’s a hint to avoid copying when possible.

Vertical vs Horizontal Concatenation

While row-wise concatenation (axis=0) is most common for appending, column-wise concatenation (axis=1) is useful for adding features.

features = pd.DataFrame({
    'id': [1, 2, 3],
    'feature1': [10, 20, 30]
})

labels = pd.DataFrame({
    'label': ['positive', 'negative', 'positive']
})

# Column-wise concatenation
combined = pd.concat([features, labels], axis=1)
print(combined)

Output:

   id  feature1     label
0   1        10  positive
1   2        20  negative
2   3        30  positive

For column concatenation with index alignment:

features = pd.DataFrame({
    'feature1': [10, 20, 30]
}, index=[1, 2, 3])

labels = pd.DataFrame({
    'label': ['positive', 'negative', 'positive', 'negative']
}, index=[1, 2, 3, 4])

# Outer join on index
combined = pd.concat([features, labels], axis=1, join='outer')

Practical Use Case: Log Aggregation

A real-world scenario demonstrating efficient DataFrame appending for log file processing:

import glob
from datetime import datetime

def process_log_files(pattern):
    """Aggregate multiple log files into a single DataFrame."""
    dataframes = []
    
    for filepath in glob.glob(pattern):
        # Read and parse log file
        df = pd.read_csv(
            filepath,
            parse_dates=['timestamp'],
            dtype={'user_id': 'category', 'action': 'category'}
        )
        
        # Add source file metadata
        df['source_file'] = filepath
        dataframes.append(df)
    
    # Single concatenation operation
    if not dataframes:
        return pd.DataFrame()
    
    combined = pd.concat(dataframes, ignore_index=True)
    
    # Sort by timestamp for chronological analysis
    combined.sort_values('timestamp', inplace=True)
    
    return combined

# Usage
logs = process_log_files('logs/*.csv')
daily_stats = logs.groupby(logs['timestamp'].dt.date).size()

This pattern efficiently processes multiple files, maintains data types, and provides a clean aggregated dataset for analysis.

Handling Duplicate Data

When appending DataFrames, duplicates may occur. Address this with the drop_duplicates() method post-concatenation:

result = pd.concat([df1, df2, df1], ignore_index=True)
result_clean = result.drop_duplicates(subset=['id'], keep='first')

For more control, use verify_integrity with unique constraints or implement custom validation logic to ensure data quality during the concatenation process.

Understanding DataFrame Concatenation

Controlling Index Behavior

Handling Column Mismatches

Performance Considerations in Loops

Appending with MultiIndex

Memory-Efficient Concatenation

Vertical vs Horizontal Concatenation

Practical Use Case: Log Aggregation

Handling Duplicate Data

Liked this? There's more.