Pandas - Compare Two DataFrames for Differences

Key Insights

Use DataFrame.equals() for quick equality checks, but combine compare(), merge(), and boolean indexing to identify specific differences between DataFrames with matching or mismatched indices
The compare() method (Pandas 1.1.0+) provides the most intuitive way to spot cell-level changes between aligned DataFrames, showing only differing values side-by-side
For production data validation pipelines, implement multi-layered comparison strategies that check schema alignment, index matching, and value differences separately to pinpoint exact discrepancies

Quick Equality Check with equals()

The simplest comparison uses DataFrame.equals() to determine if two DataFrames are identical:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'price': [10.5, 20.0, 15.75],
    'quantity': [100, 50, 75]
})

df2 = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'price': [10.5, 20.0, 15.75],
    'quantity': [100, 50, 75]
})

df3 = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'price': [10.5, 22.0, 15.75],  # Different price for B
    'quantity': [100, 50, 75]
})

print(df1.equals(df2))  # True
print(df1.equals(df3))  # False

This method returns a boolean but doesn’t tell you what differs. It’s useful for assertions in tests or conditional logic, but inadequate for debugging or auditing.

Using compare() for Cell-Level Differences

The compare() method shows exactly which cells differ between two DataFrames with matching indices:

df_old = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': [10.5, 20.0, 15.75, 30.0],
    'quantity': [100, 50, 75, 25]
})

df_new = pd.DataFrame({
    'product': ['A', 'B', 'C', 'D'],
    'price': [10.5, 22.0, 15.75, 30.0],  # B price changed
    'quantity': [100, 50, 80, 25]         # C quantity changed
})

differences = df_old.compare(df_new)
print(differences)

Output:

     price       quantity     
      self other     self other
1    20.0  22.0      NaN   NaN
2     NaN   NaN     75.0  80.0

The compare() method creates a multi-level column structure where “self” represents the calling DataFrame and “other” represents the argument. Only rows with differences appear in the result.

Customize the output with parameters:

# Keep all original columns, show NaN for equal values
differences = df_old.compare(df_new, keep_shape=True)

# Keep only rows with differences, mark equal values
differences = df_old.compare(df_new, keep_equal=True, keep_shape=True)

# Change column names from 'self'/'other'
differences = df_old.compare(df_new, result_names=('old', 'new'))
print(differences)

Identifying Rows Present in One DataFrame But Not Another

When DataFrames have different rows entirely, use merge with an indicator:

df_baseline = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'value': ['alpha', 'beta', 'gamma', 'delta']
})

df_current = pd.DataFrame({
    'id': [1, 2, 4, 5],  # Missing 3, added 5
    'value': ['alpha', 'beta', 'delta', 'epsilon']
})

# Find rows in baseline but not in current
merge_result = df_baseline.merge(
    df_current, 
    on='id', 
    how='outer', 
    indicator=True,
    suffixes=('_baseline', '_current')
)

print(merge_result)

# Filter for rows only in left (baseline)
only_in_baseline = merge_result[merge_result['_merge'] == 'left_only']
print("\nOnly in baseline:")
print(only_in_baseline)

# Filter for rows only in right (current)
only_in_current = merge_result[merge_result['_merge'] == 'right_only']
print("\nOnly in current:")
print(only_in_current)

# Rows in both but potentially different values
in_both = merge_result[merge_result['_merge'] == 'both']
print("\nIn both:")
print(in_both)

Boolean Indexing for Custom Difference Detection

For precise control over what constitutes a “difference,” use boolean indexing:

df1 = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'price': [10.50, 20.00, 15.75],
    'quantity': [100, 50, 75]
})

df2 = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'price': [10.52, 20.00, 15.80],  # Small price differences
    'quantity': [100, 50, 75]
})

# Find differences with tolerance for floating point
price_diff = ~np.isclose(df1['price'], df2['price'], rtol=0.01)
rows_with_price_diff = df1[price_diff]

print("Rows with price differences > 1%:")
print(rows_with_price_diff)

# Create a comparison summary
comparison = pd.DataFrame({
    'product': df1['product'],
    'price_old': df1['price'],
    'price_new': df2['price'],
    'price_changed': price_diff,
    'price_delta': df2['price'] - df1['price']
})

print("\nComparison summary:")
print(comparison)

Comparing DataFrames with Different Indices

When indices don’t align, reset them or merge on common columns:

df1 = pd.DataFrame({
    'sku': ['A001', 'B002', 'C003'],
    'stock': [100, 50, 75]
}, index=[0, 1, 2])

df2 = pd.DataFrame({
    'sku': ['A001', 'C003', 'D004'],
    'stock': [95, 75, 60]
}, index=[10, 11, 12])

# Reset index and merge on 'sku'
merged = df1.reset_index(drop=True).merge(
    df2.reset_index(drop=True),
    on='sku',
    how='outer',
    suffixes=('_old', '_new'),
    indicator=True
)

# Calculate differences where both exist
merged['stock_delta'] = merged['stock_new'] - merged['stock_old']

print(merged)

Comprehensive Comparison Function

Here’s a reusable function that combines multiple comparison strategies:

def comprehensive_compare(df1, df2, key_columns=None, tolerance=None):
    """
    Compare two DataFrames and return detailed difference report.
    
    Parameters:
    -----------
    df1, df2 : DataFrame
        DataFrames to compare
    key_columns : list, optional
        Columns to use as merge keys. If None, uses index.
    tolerance : float, optional
        Numerical tolerance for float comparisons
    """
    results = {}
    
    # Schema comparison
    results['columns_only_in_df1'] = list(set(df1.columns) - set(df2.columns))
    results['columns_only_in_df2'] = list(set(df2.columns) - set(df1.columns))
    results['common_columns'] = list(set(df1.columns) & set(df2.columns))
    
    if key_columns:
        # Merge-based comparison
        merged = df1.merge(
            df2,
            on=key_columns,
            how='outer',
            suffixes=('_df1', '_df2'),
            indicator=True
        )
        
        results['rows_only_in_df1'] = len(merged[merged['_merge'] == 'left_only'])
        results['rows_only_in_df2'] = len(merged[merged['_merge'] == 'right_only'])
        results['rows_in_both'] = len(merged[merged['_merge'] == 'both'])
        
        # Value differences for common rows
        value_diffs = []
        for col in results['common_columns']:
            if col not in key_columns:
                col_df1 = f"{col}_df1"
                col_df2 = f"{col}_df2"
                
                if col_df1 in merged.columns and col_df2 in merged.columns:
                    if tolerance and pd.api.types.is_numeric_dtype(merged[col_df1]):
                        diff_mask = ~np.isclose(
                            merged[col_df1], 
                            merged[col_df2], 
                            rtol=tolerance,
                            equal_nan=True
                        )
                    else:
                        diff_mask = merged[col_df1] != merged[col_df2]
                    
                    num_diffs = diff_mask.sum()
                    if num_diffs > 0:
                        value_diffs.append({
                            'column': col,
                            'num_differences': num_diffs
                        })
        
        results['value_differences'] = value_diffs
        results['detailed_merge'] = merged
        
    else:
        # Index-based comparison using compare()
        if df1.shape == df2.shape and df1.index.equals(df2.index):
            try:
                comparison = df1.compare(df2)
                results['cell_differences'] = len(comparison)
                results['detailed_comparison'] = comparison
            except Exception as e:
                results['error'] = str(e)
        else:
            results['error'] = "Shape or index mismatch"
    
    return results

# Example usage
df_old = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'score': [85.5, 92.0, 78.3]
})

df_new = pd.DataFrame({
    'id': [1, 2, 4],
    'name': ['Alice', 'Bob', 'David'],
    'score': [85.5, 93.5, 88.0]
})

report = comprehensive_compare(df_old, df_new, key_columns=['id'], tolerance=0.01)

print(f"Rows only in old: {report['rows_only_in_df1']}")
print(f"Rows only in new: {report['rows_only_in_df2']}")
print(f"Value differences: {report['value_differences']}")

Performance Considerations for Large DataFrames

When comparing large DataFrames, optimize memory usage:

# For very large DataFrames, compare in chunks
def chunked_compare(df1, df2, chunksize=10000):
    differences = []
    
    for start in range(0, len(df1), chunksize):
        end = min(start + chunksize, len(df1))
        chunk1 = df1.iloc[start:end]
        chunk2 = df2.iloc[start:end]
        
        chunk_diff = chunk1.compare(chunk2)
        if not chunk_diff.empty:
            differences.append(chunk_diff)
    
    return pd.concat(differences) if differences else pd.DataFrame()

# Use categorical types for memory efficiency
df1['category_col'] = df1['category_col'].astype('category')
df2['category_col'] = df2['category_col'].astype('category')

Choose the comparison method based on your specific needs: equals() for binary checks, compare() for aligned DataFrames, merge() for different row sets, and boolean indexing for custom tolerance logic. Combine these techniques for robust data validation pipelines.