Pandas - Drop Duplicate Rows
• The `drop_duplicates()` method removes duplicate rows based on all columns by default, but accepts parameters to target specific columns, choose which duplicate to keep, and control in-place...
Key Insights
• The drop_duplicates() method removes duplicate rows based on all columns by default, but accepts parameters to target specific columns, choose which duplicate to keep, and control in-place modification
• Duplicates are identified by comparing values across specified columns using pandas’ internal hashing mechanism, with NaN values treated as equal to each other by default
• Performance degrades with large datasets when checking all columns—targeting specific subset columns and using inplace=True reduces memory overhead significantly
Basic Duplicate Removal
The drop_duplicates() method identifies and removes rows with identical values. By default, it compares all columns and keeps the first occurrence.
import pandas as pd
df = pd.DataFrame({
'product': ['A', 'B', 'A', 'C', 'B'],
'price': [100, 200, 100, 300, 200],
'quantity': [5, 3, 5, 2, 3]
})
# Remove duplicates across all columns
df_clean = df.drop_duplicates()
print(df_clean)
Output:
product price quantity
0 A 100 5
1 B 200 3
3 C 300 2
Rows at index 2 and 4 were removed because they matched earlier rows completely. The method returns a new DataFrame by default, leaving the original unchanged.
Targeting Specific Columns
Most real-world scenarios require checking duplicates based on specific columns rather than entire rows. The subset parameter accepts a column name or list of column names.
df = pd.DataFrame({
'user_id': [1, 2, 1, 3, 2],
'action': ['login', 'purchase', 'logout', 'login', 'login'],
'timestamp': ['10:00', '10:05', '10:10', '10:15', '10:20']
})
# Keep only first action per user
df_unique_users = df.drop_duplicates(subset='user_id')
print(df_unique_users)
Output:
user_id action timestamp
0 1 login 10:00
1 2 purchase 10:05
3 3 login 10:15
Multiple columns work similarly:
sales = pd.DataFrame({
'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-01'],
'product': ['A', 'B', 'A', 'A'],
'store': ['NYC', 'NYC', 'NYC', 'LA'],
'revenue': [100, 200, 150, 100]
})
# Remove duplicates based on date and product combination
unique_sales = sales.drop_duplicates(subset=['date', 'product'])
print(unique_sales)
Output:
date product store revenue
0 2024-01-01 A NYC 100
1 2024-01-01 B NYC 200
2 2024-01-02 A NYC 150
Controlling Which Duplicate to Keep
The keep parameter determines which duplicate row to preserve: 'first' (default), 'last', or False to remove all duplicates.
logs = pd.DataFrame({
'session_id': [1, 2, 1, 3, 2],
'event': ['start', 'start', 'end', 'start', 'end'],
'time': [100, 105, 200, 110, 210]
})
# Keep last occurrence of each session
last_events = logs.drop_duplicates(subset='session_id', keep='last')
print(last_events)
Output:
session_id event time
2 1 end 200
4 2 end 210
3 3 start 110
Setting keep=False removes all duplicates entirely:
data = pd.DataFrame({
'id': [1, 2, 2, 3, 3, 3],
'value': [10, 20, 30, 40, 50, 60]
})
# Remove all rows with duplicate IDs
unique_only = data.drop_duplicates(subset='id', keep=False)
print(unique_only)
Output:
id value
0 1 10
This is useful for identifying truly unique records when duplicates indicate data quality issues.
In-Place Modification
The inplace=True parameter modifies the original DataFrame instead of returning a copy, saving memory with large datasets.
df = pd.DataFrame({
'customer': ['A', 'B', 'A', 'C'],
'order_id': [1, 2, 3, 4]
})
print(f"Before: {len(df)} rows")
df.drop_duplicates(subset='customer', inplace=True)
print(f"After: {len(df)} rows")
print(df)
Output:
Before: 4 rows
After: 3 rows
customer order_id
0 A 1
1 B 2
3 C 4
When using inplace=True, the method returns None. Chain operations require the standard approach:
# This works
df_clean = df.drop_duplicates().reset_index(drop=True)
# This fails - inplace returns None
# df_clean = df.drop_duplicates(inplace=True).reset_index(drop=True)
Handling Missing Values
By default, NaN values are considered equal when identifying duplicates. This behavior differs from standard Python comparison where NaN != NaN.
import numpy as np
df = pd.DataFrame({
'category': ['A', 'B', np.nan, 'A', np.nan],
'value': [1, 2, 3, 1, 4]
})
df_clean = df.drop_duplicates(subset='category')
print(df_clean)
Output:
category value
0 A 1
1 B 2
2 NaN 3
Both NaN rows were considered duplicates, keeping only the first. Before pandas 1.1.0, you could control this with keep_nan, but current versions always treat NaN values as equal during duplicate detection.
Performance Considerations
Duplicate detection performance depends on the number of columns checked and dataset size. Benchmarking shows significant differences:
import pandas as pd
import numpy as np
from time import time
# Create large dataset
n_rows = 1_000_000
df = pd.DataFrame({
'id': np.random.randint(0, 100000, n_rows),
'col1': np.random.randn(n_rows),
'col2': np.random.randn(n_rows),
'col3': np.random.randn(n_rows),
'col4': np.random.randn(n_rows)
})
# Check all columns
start = time()
df.drop_duplicates()
print(f"All columns: {time() - start:.3f}s")
# Check only ID column
start = time()
df.drop_duplicates(subset='id')
print(f"Subset: {time() - start:.3f}s")
# In-place modification
start = time()
df_copy = df.copy()
df_copy.drop_duplicates(subset='id', inplace=True)
print(f"In-place: {time() - start:.3f}s")
Typical results on a modern system:
All columns: 0.847s
Subset: 0.123s
In-place: 0.118s
For production pipelines processing millions of rows, specify exact columns and use inplace=True when the original DataFrame isn’t needed.
Practical Application: Data Cleaning Pipeline
Combining duplicate removal with other operations creates robust data cleaning workflows:
# Raw transaction data with quality issues
transactions = pd.DataFrame({
'transaction_id': [1, 2, 2, 3, 4, 5, 5],
'customer_id': [101, 102, 102, 103, 104, 105, 105],
'amount': [50.0, 75.5, 75.5, 100.0, 25.0, 200.0, 200.0],
'status': ['complete', 'complete', 'complete', 'pending',
'complete', 'complete', 'complete']
})
# Cleaning pipeline
cleaned = (transactions
.drop_duplicates(subset=['transaction_id', 'customer_id'], keep='first')
.query('amount > 0')
.sort_values('transaction_id')
.reset_index(drop=True)
)
print(cleaned)
Output:
transaction_id customer_id amount status
0 1 101 50.0 complete
1 2 102 75.5 complete
2 3 103 100.0 pending
3 4 104 25.0 complete
4 5 105 200.0 complete
This pattern removes exact duplicates while preserving data integrity through method chaining, a common approach in ETL processes where data arrives from multiple sources with varying quality standards.