How to Drop Duplicates Based on Specific Columns in Pandas

Duplicate data silently corrupts analysis. You calculate average order values, but some customers appear three times. You count unique users, but the same email shows up with different...

Key Insights

  • The subset parameter in drop_duplicates() lets you identify duplicates based on specific columns while preserving other column data, which is essential for real-world data cleaning where exact row matches are rare.
  • Combining sort_values() with drop_duplicates(keep='last') gives you precise control over which record survives deduplication—critical when you need to keep the most recent or highest-priority entry.
  • NaN values are treated as equal during duplicate detection, meaning two rows with NaN in the same subset column will be considered duplicates, which can lead to unexpected data loss if you’re not careful.

Introduction

Duplicate data silently corrupts analysis. You calculate average order values, but some customers appear three times. You count unique users, but the same email shows up with different capitalization. You join tables and suddenly have phantom records multiplying your row count.

Pandas provides drop_duplicates() to handle this, but the default behavior—checking all columns for exact matches—rarely matches real-world needs. Most duplicates share some identifying information while differing in other columns. A customer might have the same email but different phone numbers across records. A transaction might have the same order ID but different timestamps.

The subset parameter solves this by letting you define which columns determine uniqueness. Here’s a dataset we’ll use throughout this article:

import pandas as pd

df = pd.DataFrame({
    'customer_id': [101, 102, 101, 103, 102, 104],
    'email': ['alice@example.com', 'bob@example.com', 'alice@example.com', 
              'carol@example.com', 'bob@example.com', 'dave@example.com'],
    'order_amount': [150.00, 200.00, 175.00, 300.00, 225.00, 100.00],
    'order_date': ['2024-01-15', '2024-01-16', '2024-01-20', 
                   '2024-01-18', '2024-01-22', '2024-01-19']
})

print(df)

Output:

   customer_id              email  order_amount  order_date
0          101   alice@example.com        150.00  2024-01-15
1          102     bob@example.com        200.00  2024-01-16
2          101   alice@example.com        175.00  2024-01-20
3          103   carol@example.com        300.00  2024-01-18
4          102     bob@example.com        225.00  2024-01-22
5          104    dave@example.com        100.00  2024-01-19

Notice that customers 101 and 102 appear twice with different order amounts and dates. Without subset, drop_duplicates() would keep all six rows since no two rows are identical across all columns.

Basic Syntax: Using the subset Parameter

The subset parameter accepts a column label or a list of column labels. Only these columns are considered when identifying duplicates.

Single column deduplication:

# Keep one row per customer_id
unique_customers = df.drop_duplicates(subset=['customer_id'])
print(unique_customers)

Output:

   customer_id              email  order_amount  order_date
0          101   alice@example.com        150.00  2024-01-15
1          102     bob@example.com        200.00  2024-01-16
3          103   carol@example.com        300.00  2024-01-18
5          104    dave@example.com        100.00  2024-01-19

We now have exactly four rows—one per unique customer. The method kept the first occurrence of each customer_id and discarded later ones.

Multiple column deduplication:

Sometimes uniqueness depends on a combination of columns. Consider a scenario where the same customer could legitimately have multiple emails, but the same customer_id and email combination indicates a true duplicate:

# Keep one row per customer_id + email combination
unique_by_combo = df.drop_duplicates(subset=['customer_id', 'email'])
print(unique_by_combo)

This produces the same result in our example because customer_id and email always match together. But in messier real-world data, you’ll often need multiple columns to properly define uniqueness.

You can also pass a single string instead of a list for single-column deduplication:

# These are equivalent
df.drop_duplicates(subset='customer_id')
df.drop_duplicates(subset=['customer_id'])

I recommend always using a list for consistency and readability, especially since you’ll often need to add columns later.

Controlling Which Duplicate to Keep

The keep parameter determines which duplicate row survives. It accepts three values:

  • 'first' (default): Keep the first occurrence, drop subsequent duplicates
  • 'last': Keep the last occurrence, drop earlier duplicates
  • False: Drop all duplicates, keeping none
# Sample data with clear ordering
df_ordered = pd.DataFrame({
    'id': [1, 2, 1, 2, 3],
    'value': ['A', 'B', 'C', 'D', 'E'],
    'timestamp': [1, 2, 3, 4, 5]
})

print("Original:")
print(df_ordered)
print("\nkeep='first':")
print(df_ordered.drop_duplicates(subset=['id'], keep='first'))
print("\nkeep='last':")
print(df_ordered.drop_duplicates(subset=['id'], keep='last'))
print("\nkeep=False:")
print(df_ordered.drop_duplicates(subset=['id'], keep=False))

Output:

Original:
   id value  timestamp
0   1     A          1
1   2     B          2
2   1     C          3
3   2     D          4
4   3     E          5

keep='first':
   id value  timestamp
0   1     A          1
1   2     B          2
4   3     E          5

keep='last':
   id value  timestamp
2   1     C          3
3   2     D          4
4   3     E          5

keep=False:
   id value  timestamp
4   3     E          5

Use keep='first' when earlier records are more authoritative (original registrations, first submissions). Use keep='last' when later records should override earlier ones (most recent updates, latest status). Use keep=False when you want to identify and remove all records that have any duplicates—useful for finding problematic data that needs manual review.

In-Place Modification vs. Returning a New DataFrame

By default, drop_duplicates() returns a new DataFrame, leaving the original unchanged:

# Assignment approach (recommended)
df_clean = df.drop_duplicates(subset=['customer_id'])
print(f"Original rows: {len(df)}")
print(f"Cleaned rows: {len(df_clean)}")

Output:

Original rows: 6
Cleaned rows: 4

The inplace=True parameter modifies the DataFrame directly:

# In-place modification
df_copy = df.copy()
df_copy.drop_duplicates(subset=['customer_id'], inplace=True)
print(f"Modified rows: {len(df_copy)}")

My recommendation: avoid inplace=True. It’s being gradually deprecated across pandas, makes code harder to chain, and provides no meaningful performance benefit. The assignment pattern is clearer and more flexible:

# Chain operations cleanly
df_processed = (df
    .drop_duplicates(subset=['customer_id'])
    .reset_index(drop=True)
    .assign(processed=True)
)

For very large datasets where memory is genuinely constrained, consider processing in chunks rather than relying on inplace modifications.

Real-World Example: Cleaning Customer Data

Here’s a practical scenario: you have customer records from multiple data sources, and you need to keep only the most recent record for each email address.

# Simulated customer data with duplicates
customers = pd.DataFrame({
    'email': ['alice@example.com', 'bob@example.com', 'alice@example.com',
              'carol@example.com', 'bob@example.com', 'alice@example.com'],
    'name': ['Alice Smith', 'Bob Jones', 'Alice S.', 
             'Carol White', 'Robert Jones', 'Alice Smith-Johnson'],
    'phone': ['555-0101', '555-0102', '555-0103', 
              '555-0104', '555-0105', '555-0106'],
    'signup_date': pd.to_datetime(['2023-01-15', '2023-02-20', '2023-06-10',
                                   '2023-03-05', '2023-08-15', '2023-12-01']),
    'source': ['web', 'mobile', 'web', 'referral', 'mobile', 'web']
})

print("Original customer data:")
print(customers)

To keep the most recent record per email:

# Sort by signup_date, then keep the last (most recent) per email
customers_deduped = (customers
    .sort_values('signup_date')
    .drop_duplicates(subset=['email'], keep='last')
    .reset_index(drop=True)
)

print("\nDeduplicated (keeping most recent):")
print(customers_deduped)

Output:

Deduplicated (keeping most recent):
               email                 name     phone signup_date   source
0  carol@example.com          Carol White  555-0104  2023-03-05  referral
1    bob@example.com         Robert Jones  555-0105  2023-08-15   mobile
2  alice@example.com  Alice Smith-Johnson  555-0106  2023-12-01      web

The pattern sort_values().drop_duplicates(keep='last') is powerful. By sorting first, you control exactly which record “wins” the deduplication. You could sort by multiple columns to break ties:

# Keep most recent, but prefer 'web' source for ties
customers_priority = (customers
    .sort_values(['signup_date', 'source'], ascending=[True, True])
    .drop_duplicates(subset=['email'], keep='last')
)

Performance Tips and Common Pitfalls

NaN handling will surprise you. Pandas treats NaN values as equal during duplicate detection:

df_with_nulls = pd.DataFrame({
    'id': [1, 2, None, None, 3],
    'value': ['A', 'B', 'C', 'D', 'E']
})

print("Original:")
print(df_with_nulls)
print("\nAfter deduplication on 'id':")
print(df_with_nulls.drop_duplicates(subset=['id']))

Output:

Original:
    id value
0  1.0     A
1  2.0     B
2  NaN     C
3  NaN     D
4  3.0     E

After deduplication on 'id':
    id value
0  1.0     A
1  2.0     B
2  NaN     C
4  3.0     E

Row 3 was dropped because it’s considered a duplicate of row 2—both have NaN in the id column. If you need to preserve rows with null values, handle them separately:

# Preserve all rows with null IDs
df_with_ids = df_with_nulls[df_with_nulls['id'].notna()]
df_null_ids = df_with_nulls[df_with_nulls['id'].isna()]

df_deduped = pd.concat([
    df_with_ids.drop_duplicates(subset=['id']),
    df_null_ids
])

Common mistakes to avoid:

  1. Forgetting assignment: df.drop_duplicates(subset=['id']) does nothing without assignment or inplace=True.

  2. Typos in column names: Pandas raises KeyError for missing columns, but watch for subtle typos that match other columns.

  3. Assuming row order: After deduplication, row order depends on keep and original ordering. Always sort explicitly if order matters.

Performance considerations: For datasets over a million rows, drop_duplicates() performs well because it uses hash-based detection. However, sorting first (for keep='last' patterns) adds O(n log n) complexity. If you only need keep='first' behavior, skip the sort.

Conclusion

The subset parameter transforms drop_duplicates() from a blunt instrument into a precise data cleaning tool. Here’s your quick reference:

Parameter Values Purpose
subset column name or list Columns to check for duplicates
keep ‘first’, ’last’, False Which duplicate to retain
inplace True/False Modify in place (avoid this)

The most useful pattern for real-world cleaning: df.sort_values('date').drop_duplicates(subset=['id'], keep='last'). This gives you the most recent record per identifier, which handles the majority of deduplication scenarios you’ll encounter.

For edge cases involving NaN values or complex deduplication logic, consider using groupby() with aggregation functions instead—it provides more explicit control over how duplicate groups are resolved.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.