Pandas - Compare Two DataFrames for Differences
The simplest comparison uses `DataFrame.equals()` to determine if two DataFrames are identical:
Key Insights
- Use
DataFrame.equals()for quick equality checks, but combinecompare(),merge(), and boolean indexing to identify specific differences between DataFrames with matching or mismatched indices - The
compare()method (Pandas 1.1.0+) provides the most intuitive way to spot cell-level changes between aligned DataFrames, showing only differing values side-by-side - For production data validation pipelines, implement multi-layered comparison strategies that check schema alignment, index matching, and value differences separately to pinpoint exact discrepancies
Quick Equality Check with equals()
The simplest comparison uses DataFrame.equals() to determine if two DataFrames are identical:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'product': ['A', 'B', 'C'],
'price': [10.5, 20.0, 15.75],
'quantity': [100, 50, 75]
})
df2 = pd.DataFrame({
'product': ['A', 'B', 'C'],
'price': [10.5, 20.0, 15.75],
'quantity': [100, 50, 75]
})
df3 = pd.DataFrame({
'product': ['A', 'B', 'C'],
'price': [10.5, 22.0, 15.75], # Different price for B
'quantity': [100, 50, 75]
})
print(df1.equals(df2)) # True
print(df1.equals(df3)) # False
This method returns a boolean but doesn’t tell you what differs. It’s useful for assertions in tests or conditional logic, but inadequate for debugging or auditing.
Using compare() for Cell-Level Differences
The compare() method shows exactly which cells differ between two DataFrames with matching indices:
df_old = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'price': [10.5, 20.0, 15.75, 30.0],
'quantity': [100, 50, 75, 25]
})
df_new = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'price': [10.5, 22.0, 15.75, 30.0], # B price changed
'quantity': [100, 50, 80, 25] # C quantity changed
})
differences = df_old.compare(df_new)
print(differences)
Output:
price quantity
self other self other
1 20.0 22.0 NaN NaN
2 NaN NaN 75.0 80.0
The compare() method creates a multi-level column structure where “self” represents the calling DataFrame and “other” represents the argument. Only rows with differences appear in the result.
Customize the output with parameters:
# Keep all original columns, show NaN for equal values
differences = df_old.compare(df_new, keep_shape=True)
# Keep only rows with differences, mark equal values
differences = df_old.compare(df_new, keep_equal=True, keep_shape=True)
# Change column names from 'self'/'other'
differences = df_old.compare(df_new, result_names=('old', 'new'))
print(differences)
Identifying Rows Present in One DataFrame But Not Another
When DataFrames have different rows entirely, use merge with an indicator:
df_baseline = pd.DataFrame({
'id': [1, 2, 3, 4],
'value': ['alpha', 'beta', 'gamma', 'delta']
})
df_current = pd.DataFrame({
'id': [1, 2, 4, 5], # Missing 3, added 5
'value': ['alpha', 'beta', 'delta', 'epsilon']
})
# Find rows in baseline but not in current
merge_result = df_baseline.merge(
df_current,
on='id',
how='outer',
indicator=True,
suffixes=('_baseline', '_current')
)
print(merge_result)
# Filter for rows only in left (baseline)
only_in_baseline = merge_result[merge_result['_merge'] == 'left_only']
print("\nOnly in baseline:")
print(only_in_baseline)
# Filter for rows only in right (current)
only_in_current = merge_result[merge_result['_merge'] == 'right_only']
print("\nOnly in current:")
print(only_in_current)
# Rows in both but potentially different values
in_both = merge_result[merge_result['_merge'] == 'both']
print("\nIn both:")
print(in_both)
Boolean Indexing for Custom Difference Detection
For precise control over what constitutes a “difference,” use boolean indexing:
df1 = pd.DataFrame({
'product': ['A', 'B', 'C'],
'price': [10.50, 20.00, 15.75],
'quantity': [100, 50, 75]
})
df2 = pd.DataFrame({
'product': ['A', 'B', 'C'],
'price': [10.52, 20.00, 15.80], # Small price differences
'quantity': [100, 50, 75]
})
# Find differences with tolerance for floating point
price_diff = ~np.isclose(df1['price'], df2['price'], rtol=0.01)
rows_with_price_diff = df1[price_diff]
print("Rows with price differences > 1%:")
print(rows_with_price_diff)
# Create a comparison summary
comparison = pd.DataFrame({
'product': df1['product'],
'price_old': df1['price'],
'price_new': df2['price'],
'price_changed': price_diff,
'price_delta': df2['price'] - df1['price']
})
print("\nComparison summary:")
print(comparison)
Comparing DataFrames with Different Indices
When indices don’t align, reset them or merge on common columns:
df1 = pd.DataFrame({
'sku': ['A001', 'B002', 'C003'],
'stock': [100, 50, 75]
}, index=[0, 1, 2])
df2 = pd.DataFrame({
'sku': ['A001', 'C003', 'D004'],
'stock': [95, 75, 60]
}, index=[10, 11, 12])
# Reset index and merge on 'sku'
merged = df1.reset_index(drop=True).merge(
df2.reset_index(drop=True),
on='sku',
how='outer',
suffixes=('_old', '_new'),
indicator=True
)
# Calculate differences where both exist
merged['stock_delta'] = merged['stock_new'] - merged['stock_old']
print(merged)
Comprehensive Comparison Function
Here’s a reusable function that combines multiple comparison strategies:
def comprehensive_compare(df1, df2, key_columns=None, tolerance=None):
"""
Compare two DataFrames and return detailed difference report.
Parameters:
-----------
df1, df2 : DataFrame
DataFrames to compare
key_columns : list, optional
Columns to use as merge keys. If None, uses index.
tolerance : float, optional
Numerical tolerance for float comparisons
"""
results = {}
# Schema comparison
results['columns_only_in_df1'] = list(set(df1.columns) - set(df2.columns))
results['columns_only_in_df2'] = list(set(df2.columns) - set(df1.columns))
results['common_columns'] = list(set(df1.columns) & set(df2.columns))
if key_columns:
# Merge-based comparison
merged = df1.merge(
df2,
on=key_columns,
how='outer',
suffixes=('_df1', '_df2'),
indicator=True
)
results['rows_only_in_df1'] = len(merged[merged['_merge'] == 'left_only'])
results['rows_only_in_df2'] = len(merged[merged['_merge'] == 'right_only'])
results['rows_in_both'] = len(merged[merged['_merge'] == 'both'])
# Value differences for common rows
value_diffs = []
for col in results['common_columns']:
if col not in key_columns:
col_df1 = f"{col}_df1"
col_df2 = f"{col}_df2"
if col_df1 in merged.columns and col_df2 in merged.columns:
if tolerance and pd.api.types.is_numeric_dtype(merged[col_df1]):
diff_mask = ~np.isclose(
merged[col_df1],
merged[col_df2],
rtol=tolerance,
equal_nan=True
)
else:
diff_mask = merged[col_df1] != merged[col_df2]
num_diffs = diff_mask.sum()
if num_diffs > 0:
value_diffs.append({
'column': col,
'num_differences': num_diffs
})
results['value_differences'] = value_diffs
results['detailed_merge'] = merged
else:
# Index-based comparison using compare()
if df1.shape == df2.shape and df1.index.equals(df2.index):
try:
comparison = df1.compare(df2)
results['cell_differences'] = len(comparison)
results['detailed_comparison'] = comparison
except Exception as e:
results['error'] = str(e)
else:
results['error'] = "Shape or index mismatch"
return results
# Example usage
df_old = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'score': [85.5, 92.0, 78.3]
})
df_new = pd.DataFrame({
'id': [1, 2, 4],
'name': ['Alice', 'Bob', 'David'],
'score': [85.5, 93.5, 88.0]
})
report = comprehensive_compare(df_old, df_new, key_columns=['id'], tolerance=0.01)
print(f"Rows only in old: {report['rows_only_in_df1']}")
print(f"Rows only in new: {report['rows_only_in_df2']}")
print(f"Value differences: {report['value_differences']}")
Performance Considerations for Large DataFrames
When comparing large DataFrames, optimize memory usage:
# For very large DataFrames, compare in chunks
def chunked_compare(df1, df2, chunksize=10000):
differences = []
for start in range(0, len(df1), chunksize):
end = min(start + chunksize, len(df1))
chunk1 = df1.iloc[start:end]
chunk2 = df2.iloc[start:end]
chunk_diff = chunk1.compare(chunk2)
if not chunk_diff.empty:
differences.append(chunk_diff)
return pd.concat(differences) if differences else pd.DataFrame()
# Use categorical types for memory efficiency
df1['category_col'] = df1['category_col'].astype('category')
df2['category_col'] = df2['category_col'].astype('category')
Choose the comparison method based on your specific needs: equals() for binary checks, compare() for aligned DataFrames, merge() for different row sets, and boolean indexing for custom tolerance logic. Combine these techniques for robust data validation pipelines.