How to Drop Duplicates Based on Specific Columns in Pandas
Duplicate data silently corrupts analysis. You calculate average order values, but some customers appear three times. You count unique users, but the same email shows up with different...
Key Insights
- The
subsetparameter indrop_duplicates()lets you identify duplicates based on specific columns while preserving other column data, which is essential for real-world data cleaning where exact row matches are rare. - Combining
sort_values()withdrop_duplicates(keep='last')gives you precise control over which record survives deduplication—critical when you need to keep the most recent or highest-priority entry. - NaN values are treated as equal during duplicate detection, meaning two rows with NaN in the same subset column will be considered duplicates, which can lead to unexpected data loss if you’re not careful.
Introduction
Duplicate data silently corrupts analysis. You calculate average order values, but some customers appear three times. You count unique users, but the same email shows up with different capitalization. You join tables and suddenly have phantom records multiplying your row count.
Pandas provides drop_duplicates() to handle this, but the default behavior—checking all columns for exact matches—rarely matches real-world needs. Most duplicates share some identifying information while differing in other columns. A customer might have the same email but different phone numbers across records. A transaction might have the same order ID but different timestamps.
The subset parameter solves this by letting you define which columns determine uniqueness. Here’s a dataset we’ll use throughout this article:
import pandas as pd
df = pd.DataFrame({
'customer_id': [101, 102, 101, 103, 102, 104],
'email': ['alice@example.com', 'bob@example.com', 'alice@example.com',
'carol@example.com', 'bob@example.com', 'dave@example.com'],
'order_amount': [150.00, 200.00, 175.00, 300.00, 225.00, 100.00],
'order_date': ['2024-01-15', '2024-01-16', '2024-01-20',
'2024-01-18', '2024-01-22', '2024-01-19']
})
print(df)
Output:
customer_id email order_amount order_date
0 101 alice@example.com 150.00 2024-01-15
1 102 bob@example.com 200.00 2024-01-16
2 101 alice@example.com 175.00 2024-01-20
3 103 carol@example.com 300.00 2024-01-18
4 102 bob@example.com 225.00 2024-01-22
5 104 dave@example.com 100.00 2024-01-19
Notice that customers 101 and 102 appear twice with different order amounts and dates. Without subset, drop_duplicates() would keep all six rows since no two rows are identical across all columns.
Basic Syntax: Using the subset Parameter
The subset parameter accepts a column label or a list of column labels. Only these columns are considered when identifying duplicates.
Single column deduplication:
# Keep one row per customer_id
unique_customers = df.drop_duplicates(subset=['customer_id'])
print(unique_customers)
Output:
customer_id email order_amount order_date
0 101 alice@example.com 150.00 2024-01-15
1 102 bob@example.com 200.00 2024-01-16
3 103 carol@example.com 300.00 2024-01-18
5 104 dave@example.com 100.00 2024-01-19
We now have exactly four rows—one per unique customer. The method kept the first occurrence of each customer_id and discarded later ones.
Multiple column deduplication:
Sometimes uniqueness depends on a combination of columns. Consider a scenario where the same customer could legitimately have multiple emails, but the same customer_id and email combination indicates a true duplicate:
# Keep one row per customer_id + email combination
unique_by_combo = df.drop_duplicates(subset=['customer_id', 'email'])
print(unique_by_combo)
This produces the same result in our example because customer_id and email always match together. But in messier real-world data, you’ll often need multiple columns to properly define uniqueness.
You can also pass a single string instead of a list for single-column deduplication:
# These are equivalent
df.drop_duplicates(subset='customer_id')
df.drop_duplicates(subset=['customer_id'])
I recommend always using a list for consistency and readability, especially since you’ll often need to add columns later.
Controlling Which Duplicate to Keep
The keep parameter determines which duplicate row survives. It accepts three values:
'first'(default): Keep the first occurrence, drop subsequent duplicates'last': Keep the last occurrence, drop earlier duplicatesFalse: Drop all duplicates, keeping none
# Sample data with clear ordering
df_ordered = pd.DataFrame({
'id': [1, 2, 1, 2, 3],
'value': ['A', 'B', 'C', 'D', 'E'],
'timestamp': [1, 2, 3, 4, 5]
})
print("Original:")
print(df_ordered)
print("\nkeep='first':")
print(df_ordered.drop_duplicates(subset=['id'], keep='first'))
print("\nkeep='last':")
print(df_ordered.drop_duplicates(subset=['id'], keep='last'))
print("\nkeep=False:")
print(df_ordered.drop_duplicates(subset=['id'], keep=False))
Output:
Original:
id value timestamp
0 1 A 1
1 2 B 2
2 1 C 3
3 2 D 4
4 3 E 5
keep='first':
id value timestamp
0 1 A 1
1 2 B 2
4 3 E 5
keep='last':
id value timestamp
2 1 C 3
3 2 D 4
4 3 E 5
keep=False:
id value timestamp
4 3 E 5
Use keep='first' when earlier records are more authoritative (original registrations, first submissions). Use keep='last' when later records should override earlier ones (most recent updates, latest status). Use keep=False when you want to identify and remove all records that have any duplicates—useful for finding problematic data that needs manual review.
In-Place Modification vs. Returning a New DataFrame
By default, drop_duplicates() returns a new DataFrame, leaving the original unchanged:
# Assignment approach (recommended)
df_clean = df.drop_duplicates(subset=['customer_id'])
print(f"Original rows: {len(df)}")
print(f"Cleaned rows: {len(df_clean)}")
Output:
Original rows: 6
Cleaned rows: 4
The inplace=True parameter modifies the DataFrame directly:
# In-place modification
df_copy = df.copy()
df_copy.drop_duplicates(subset=['customer_id'], inplace=True)
print(f"Modified rows: {len(df_copy)}")
My recommendation: avoid inplace=True. It’s being gradually deprecated across pandas, makes code harder to chain, and provides no meaningful performance benefit. The assignment pattern is clearer and more flexible:
# Chain operations cleanly
df_processed = (df
.drop_duplicates(subset=['customer_id'])
.reset_index(drop=True)
.assign(processed=True)
)
For very large datasets where memory is genuinely constrained, consider processing in chunks rather than relying on inplace modifications.
Real-World Example: Cleaning Customer Data
Here’s a practical scenario: you have customer records from multiple data sources, and you need to keep only the most recent record for each email address.
# Simulated customer data with duplicates
customers = pd.DataFrame({
'email': ['alice@example.com', 'bob@example.com', 'alice@example.com',
'carol@example.com', 'bob@example.com', 'alice@example.com'],
'name': ['Alice Smith', 'Bob Jones', 'Alice S.',
'Carol White', 'Robert Jones', 'Alice Smith-Johnson'],
'phone': ['555-0101', '555-0102', '555-0103',
'555-0104', '555-0105', '555-0106'],
'signup_date': pd.to_datetime(['2023-01-15', '2023-02-20', '2023-06-10',
'2023-03-05', '2023-08-15', '2023-12-01']),
'source': ['web', 'mobile', 'web', 'referral', 'mobile', 'web']
})
print("Original customer data:")
print(customers)
To keep the most recent record per email:
# Sort by signup_date, then keep the last (most recent) per email
customers_deduped = (customers
.sort_values('signup_date')
.drop_duplicates(subset=['email'], keep='last')
.reset_index(drop=True)
)
print("\nDeduplicated (keeping most recent):")
print(customers_deduped)
Output:
Deduplicated (keeping most recent):
email name phone signup_date source
0 carol@example.com Carol White 555-0104 2023-03-05 referral
1 bob@example.com Robert Jones 555-0105 2023-08-15 mobile
2 alice@example.com Alice Smith-Johnson 555-0106 2023-12-01 web
The pattern sort_values().drop_duplicates(keep='last') is powerful. By sorting first, you control exactly which record “wins” the deduplication. You could sort by multiple columns to break ties:
# Keep most recent, but prefer 'web' source for ties
customers_priority = (customers
.sort_values(['signup_date', 'source'], ascending=[True, True])
.drop_duplicates(subset=['email'], keep='last')
)
Performance Tips and Common Pitfalls
NaN handling will surprise you. Pandas treats NaN values as equal during duplicate detection:
df_with_nulls = pd.DataFrame({
'id': [1, 2, None, None, 3],
'value': ['A', 'B', 'C', 'D', 'E']
})
print("Original:")
print(df_with_nulls)
print("\nAfter deduplication on 'id':")
print(df_with_nulls.drop_duplicates(subset=['id']))
Output:
Original:
id value
0 1.0 A
1 2.0 B
2 NaN C
3 NaN D
4 3.0 E
After deduplication on 'id':
id value
0 1.0 A
1 2.0 B
2 NaN C
4 3.0 E
Row 3 was dropped because it’s considered a duplicate of row 2—both have NaN in the id column. If you need to preserve rows with null values, handle them separately:
# Preserve all rows with null IDs
df_with_ids = df_with_nulls[df_with_nulls['id'].notna()]
df_null_ids = df_with_nulls[df_with_nulls['id'].isna()]
df_deduped = pd.concat([
df_with_ids.drop_duplicates(subset=['id']),
df_null_ids
])
Common mistakes to avoid:
-
Forgetting assignment:
df.drop_duplicates(subset=['id'])does nothing without assignment orinplace=True. -
Typos in column names: Pandas raises
KeyErrorfor missing columns, but watch for subtle typos that match other columns. -
Assuming row order: After deduplication, row order depends on
keepand original ordering. Always sort explicitly if order matters.
Performance considerations: For datasets over a million rows, drop_duplicates() performs well because it uses hash-based detection. However, sorting first (for keep='last' patterns) adds O(n log n) complexity. If you only need keep='first' behavior, skip the sort.
Conclusion
The subset parameter transforms drop_duplicates() from a blunt instrument into a precise data cleaning tool. Here’s your quick reference:
| Parameter | Values | Purpose |
|---|---|---|
subset |
column name or list | Columns to check for duplicates |
keep |
‘first’, ’last’, False | Which duplicate to retain |
inplace |
True/False | Modify in place (avoid this) |
The most useful pattern for real-world cleaning: df.sort_values('date').drop_duplicates(subset=['id'], keep='last'). This gives you the most recent record per identifier, which handles the majority of deduplication scenarios you’ll encounter.
For edge cases involving NaN values or complex deduplication logic, consider using groupby() with aggregation functions instead—it provides more explicit control over how duplicate groups are resolved.