Pandas - Append DataFrames
Appending DataFrames is a fundamental operation in data manipulation workflows. The primary method is `pd.concat()`, which concatenates pandas objects along a particular axis with optional set logic...
Key Insights
- The
pd.concat()function is the recommended method for appending DataFrames, offering flexibility for both row-wise and column-wise concatenation with control over index handling and duplicate management. - The deprecated
DataFrame.append()method should be avoided in favor ofpd.concat()or list accumulation patterns that prevent quadratic time complexity in loops. - Performance optimization requires understanding when to use
ignore_index=True, how to handle memory efficiently with large datasets, and when to preallocate space or use alternative approaches like list comprehension.
Understanding DataFrame Concatenation
Appending DataFrames is a fundamental operation in data manipulation workflows. The primary method is pd.concat(), which concatenates pandas objects along a particular axis with optional set logic along other axes.
import pandas as pd
df1 = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'score': [85, 92, 78]
})
df2 = pd.DataFrame({
'id': [4, 5],
'name': ['David', 'Eve'],
'score': [88, 95]
})
# Basic concatenation
result = pd.concat([df1, df2])
print(result)
Output:
id name score
0 1 Alice 85
1 2 Bob 92
2 3 Charlie 78
0 4 David 88
1 5 Eve 95
Notice the index values repeat. This is often undesirable and can be addressed with the ignore_index parameter.
Controlling Index Behavior
When appending DataFrames, managing the resulting index is critical for subsequent operations. The ignore_index=True parameter creates a new sequential integer index.
result = pd.concat([df1, df2], ignore_index=True)
print(result)
Output:
id name score
0 1 Alice 85
1 2 Bob 92
2 3 Charlie 78
3 4 David 88
4 5 Eve 95
For maintaining the original index with verification:
result = pd.concat([df1, df2], verify_integrity=True)
This raises a ValueError if there are duplicate index values, ensuring data integrity when index preservation matters.
Handling Column Mismatches
Real-world scenarios often involve DataFrames with different column sets. The join parameter controls how to handle this situation.
df3 = pd.DataFrame({
'id': [6, 7],
'name': ['Frank', 'Grace'],
'department': ['Engineering', 'Sales']
})
# Outer join (default) - keeps all columns
result_outer = pd.concat([df1, df3], ignore_index=True)
print(result_outer)
Output:
id name score department
0 1 Alice 85.0 NaN
1 2 Bob 92.0 NaN
2 3 Charlie 78.0 NaN
3 6 Frank NaN Engineering
4 7 Grace NaN Sales
For inner join, keeping only common columns:
result_inner = pd.concat([df1, df3], join='inner', ignore_index=True)
print(result_inner)
Output:
id name
0 1 Alice
1 2 Bob
2 3 Charlie
3 6 Frank
4 7 Grace
Performance Considerations in Loops
A common anti-pattern involves repeatedly appending DataFrames in loops, which creates quadratic time complexity due to repeated memory allocation.
Anti-pattern:
# Avoid this - O(n²) complexity
result = pd.DataFrame()
for i in range(1000):
temp_df = pd.DataFrame({'value': [i]})
result = pd.concat([result, temp_df], ignore_index=True)
Optimized approach:
# Collect DataFrames in a list first
dataframes = []
for i in range(1000):
temp_df = pd.DataFrame({'value': [i]})
dataframes.append(temp_df)
# Single concatenation operation
result = pd.concat(dataframes, ignore_index=True)
For simple data generation, direct construction is even more efficient:
# Most efficient for simple cases
result = pd.DataFrame({'value': range(1000)})
Appending with MultiIndex
When working with hierarchical indexes, pd.concat() provides the keys parameter to create a MultiIndex.
sales_q1 = pd.DataFrame({
'product': ['A', 'B', 'C'],
'revenue': [10000, 15000, 12000]
})
sales_q2 = pd.DataFrame({
'product': ['A', 'B', 'C'],
'revenue': [11000, 16000, 13000]
})
result = pd.concat(
[sales_q1, sales_q2],
keys=['Q1', 'Q2'],
names=['Quarter', 'Index']
)
print(result)
Output:
product revenue
Quarter Index
Q1 0 A 10000
1 B 15000
2 C 12000
Q2 0 A 11000
1 B 16000
2 C 13000
This structure enables powerful grouping and filtering operations:
# Access Q1 data
q1_data = result.loc['Q1']
# Calculate quarterly totals
quarterly_totals = result.groupby('Quarter')['revenue'].sum()
Memory-Efficient Concatenation
For large datasets, memory efficiency becomes paramount. Using categorical data types and chunked processing can significantly reduce memory footprint.
# Efficient concatenation with categorical conversion
df_list = []
for chunk in pd.read_csv('large_file.csv', chunksize=10000):
# Convert string columns to categorical
for col in chunk.select_dtypes(include=['object']).columns:
chunk[col] = chunk[col].astype('category')
df_list.append(chunk)
result = pd.concat(df_list, ignore_index=True)
For extremely large datasets, consider using copy=False when appropriate:
# Avoid unnecessary copying
result = pd.concat([df1, df2], copy=False)
Note that copy=False doesn’t guarantee no copying—it’s a hint to avoid copying when possible.
Vertical vs Horizontal Concatenation
While row-wise concatenation (axis=0) is most common for appending, column-wise concatenation (axis=1) is useful for adding features.
features = pd.DataFrame({
'id': [1, 2, 3],
'feature1': [10, 20, 30]
})
labels = pd.DataFrame({
'label': ['positive', 'negative', 'positive']
})
# Column-wise concatenation
combined = pd.concat([features, labels], axis=1)
print(combined)
Output:
id feature1 label
0 1 10 positive
1 2 20 negative
2 3 30 positive
For column concatenation with index alignment:
features = pd.DataFrame({
'feature1': [10, 20, 30]
}, index=[1, 2, 3])
labels = pd.DataFrame({
'label': ['positive', 'negative', 'positive', 'negative']
}, index=[1, 2, 3, 4])
# Outer join on index
combined = pd.concat([features, labels], axis=1, join='outer')
Practical Use Case: Log Aggregation
A real-world scenario demonstrating efficient DataFrame appending for log file processing:
import glob
from datetime import datetime
def process_log_files(pattern):
"""Aggregate multiple log files into a single DataFrame."""
dataframes = []
for filepath in glob.glob(pattern):
# Read and parse log file
df = pd.read_csv(
filepath,
parse_dates=['timestamp'],
dtype={'user_id': 'category', 'action': 'category'}
)
# Add source file metadata
df['source_file'] = filepath
dataframes.append(df)
# Single concatenation operation
if not dataframes:
return pd.DataFrame()
combined = pd.concat(dataframes, ignore_index=True)
# Sort by timestamp for chronological analysis
combined.sort_values('timestamp', inplace=True)
return combined
# Usage
logs = process_log_files('logs/*.csv')
daily_stats = logs.groupby(logs['timestamp'].dt.date).size()
This pattern efficiently processes multiple files, maintains data types, and provides a clean aggregated dataset for analysis.
Handling Duplicate Data
When appending DataFrames, duplicates may occur. Address this with the drop_duplicates() method post-concatenation:
result = pd.concat([df1, df2, df1], ignore_index=True)
result_clean = result.drop_duplicates(subset=['id'], keep='first')
For more control, use verify_integrity with unique constraints or implement custom validation logic to ensure data quality during the concatenation process.