How to Drop Duplicate Rows in Pandas

Duplicate rows are inevitable in real-world datasets. They creep in through database merges, manual data entry errors, repeated API calls, or CSV imports that accidentally run twice. Left unchecked,...

Key Insights

  • The drop_duplicates() method checks all columns by default, but using the subset parameter lets you target specific columns for deduplication—essential when you need to keep unique records by ID while allowing other fields to vary.
  • The keep parameter controls which duplicate survives: 'first' (default), 'last', or False to remove all duplicates entirely—choose based on whether your data has a meaningful order.
  • Always use duplicated() to preview which rows will be removed before actually dropping them; this prevents accidental data loss and helps you understand your duplicate patterns.

Introduction to Duplicate Data

Duplicate rows are inevitable in real-world datasets. They creep in through database merges, manual data entry errors, repeated API calls, or CSV imports that accidentally run twice. Left unchecked, duplicates skew your analysis—inflating counts, distorting averages, and creating misleading visualizations.

Pandas provides the drop_duplicates() method to handle this problem cleanly. Before diving into the method itself, let’s create a sample DataFrame with intentional duplicates:

import pandas as pd

# Sample customer data with duplicates
data = {
    'customer_id': [101, 102, 101, 103, 102, 104],
    'email': ['alice@email.com', 'bob@email.com', 'alice@email.com', 
              'carol@email.com', 'bob@email.com', 'dave@email.com'],
    'purchase_amount': [150.00, 200.00, 150.00, 75.00, 200.00, 300.00],
    'purchase_date': ['2024-01-15', '2024-01-16', '2024-01-15', 
                      '2024-01-17', '2024-01-16', '2024-01-18']
}

df = pd.DataFrame(data)
print(df)

Output:

   customer_id            email  purchase_amount purchase_date
0          101  alice@email.com           150.00    2024-01-15
1          102    bob@email.com           200.00    2024-01-16
2          101  alice@email.com           150.00    2024-01-15
3          103  carol@email.com            75.00    2024-01-17
4          102    bob@email.com           200.00    2024-01-16
5          104   dave@email.com           300.00    2024-01-18

Rows 0 and 2 are exact duplicates, as are rows 1 and 4. This is the kind of data pollution that drop_duplicates() was built to handle.

Basic Usage of drop_duplicates()

The simplest form of drop_duplicates() requires no arguments. By default, it examines all columns, keeps the first occurrence of each duplicate set, and returns a new DataFrame:

# Remove all duplicate rows
df_clean = df.drop_duplicates()
print(df_clean)

Output:

   customer_id            email  purchase_amount purchase_date
0          101  alice@email.com           150.00    2024-01-15
1          102    bob@email.com           200.00    2024-01-16
3          103  carol@email.com            75.00    2024-01-17
5          104   dave@email.com           300.00    2024-01-18

Notice that rows 2 and 4 are gone. The original DataFrame df remains unchanged—drop_duplicates() returns a new DataFrame by default.

If you want to modify the DataFrame in place, use the inplace parameter:

# Modify the original DataFrame directly
df.drop_duplicates(inplace=True)

I generally avoid inplace=True in modern Pandas code. It’s being phased out in favor of explicit reassignment (df = df.drop_duplicates()), which is clearer and plays better with method chaining.

Selecting Columns with subset

Real-world deduplication rarely means checking every column. More often, you want to identify duplicates based on specific fields—like removing duplicate customers by email address, regardless of when they made purchases.

The subset parameter accepts a column name or list of column names:

# Create data where only some columns have duplicates
data = {
    'customer_id': [101, 102, 103, 104, 105],
    'email': ['alice@email.com', 'bob@email.com', 'alice@email.com', 
              'carol@email.com', 'bob@email.com'],
    'purchase_amount': [150.00, 200.00, 175.00, 75.00, 250.00],
    'purchase_date': ['2024-01-15', '2024-01-16', '2024-01-20', 
                      '2024-01-17', '2024-01-22']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Drop duplicates based only on email
df_unique_emails = df.drop_duplicates(subset=['email'])
print("\nAfter dropping duplicates by email:")
print(df_unique_emails)

Output:

Original DataFrame:
   customer_id            email  purchase_amount purchase_date
0          101  alice@email.com           150.00    2024-01-15
1          102    bob@email.com           200.00    2024-01-16
2          103  alice@email.com           175.00    2024-01-20
3          104  carol@email.com            75.00    2024-01-17
4          105    bob@email.com           250.00    2024-01-22

After dropping duplicates by email:
   customer_id            email  purchase_amount purchase_date
0          101  alice@email.com           150.00    2024-01-15
1          102    bob@email.com           200.00    2024-01-16
3          104  carol@email.com            75.00    2024-01-17

Rows 2 and 4 were removed because their emails matched earlier rows, even though the other columns differed. This is incredibly useful for scenarios like:

  • Keeping the first order per customer
  • Deduplicating contacts by phone number
  • Removing repeated log entries by timestamp and event type

You can specify multiple columns in the subset:

# Drop duplicates where both customer_id AND purchase_date match
df.drop_duplicates(subset=['customer_id', 'purchase_date'])

Controlling Which Rows to Keep

The keep parameter determines which duplicate survives the culling. It accepts three values:

  • 'first' (default): Keep the first occurrence
  • 'last': Keep the last occurrence
  • False: Remove all duplicates entirely

Here’s a practical comparison:

data = {
    'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget'],
    'price': [10.00, 25.00, 12.00, 25.00, 11.00],
    'updated': ['2024-01-01', '2024-01-01', '2024-01-15', 
                '2024-01-20', '2024-01-30']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

print("\nkeep='first' (default):")
print(df.drop_duplicates(subset=['product'], keep='first'))

print("\nkeep='last':")
print(df.drop_duplicates(subset=['product'], keep='last'))

print("\nkeep=False (remove all duplicates):")
print(df.drop_duplicates(subset=['product'], keep=False))

Output:

Original DataFrame:
  product  price     updated
0  Widget  10.00  2024-01-01
1  Gadget  25.00  2024-01-01
2  Widget  12.00  2024-01-15
3  Gadget  25.00  2024-01-20
4  Widget  11.00  2024-01-30

keep='first' (default):
  product  price     updated
0  Widget  10.00  2024-01-01
1  Gadget  25.00  2024-01-01

keep='last':
  product  price     updated
3  Gadget  25.00  2024-01-20
4  Widget  11.00  2024-01-30

keep=False (remove all duplicates):
Empty DataFrame
Columns: [product, price, updated]
Index: []

Use keep='last' when your data is chronologically ordered and you want the most recent version. Use keep=False when duplicates indicate data quality issues and you want to flag or remove all affected records for manual review.

Identifying Duplicates Before Dropping

Blindly dropping rows is risky. The duplicated() method lets you preview which rows would be affected:

data = {
    'order_id': [1001, 1002, 1001, 1003, 1002],
    'product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard', 'Mouse'],
    'quantity': [1, 2, 1, 1, 2]
}

df = pd.DataFrame(data)

# See which rows are duplicates
print("Duplicate mask:")
print(df.duplicated())

# Filter to show only duplicate rows
print("\nDuplicate rows:")
print(df[df.duplicated()])

# Show ALL rows involved in duplication (including first occurrences)
print("\nAll rows with duplicates (including originals):")
print(df[df.duplicated(keep=False)])

Output:

Duplicate mask:
0    False
1    False
2     True
3    False
4     True
dtype: bool

Duplicate rows:
   order_id product  quantity
2      1001  Laptop         1
4      1002   Mouse         2

All rows with duplicates (including originals):
   order_id product  quantity
0      1001  Laptop         1
1      1002   Mouse         2
2      1001  Laptop         1
4      1002   Mouse         2

The duplicated() method accepts the same subset and keep parameters as drop_duplicates(). Use it to audit your data before making irreversible changes.

Performance Considerations

For large datasets, a few practices improve performance and prevent surprises.

First, avoid inplace=True. Despite intuition, it doesn’t save memory—Pandas still creates an intermediate copy internally. Explicit reassignment is clearer:

# Preferred approach
df = df.drop_duplicates()

Second, reset your index after dropping duplicates. The original index values are preserved by default, which can cause confusion:

# Clean up the index
df = df.drop_duplicates().reset_index(drop=True)
print(df)

The drop=True argument prevents the old index from being added as a new column.

Third, for very large datasets, consider dropping duplicates early in your pipeline—before expensive transformations. If you’re reading from a file and know duplicates exist, handle them immediately:

df = pd.read_csv('large_file.csv').drop_duplicates(subset=['id'])

Conclusion and Quick Reference

Handling duplicates is a fundamental data cleaning task. Here’s a quick reference for the drop_duplicates() parameters:

Parameter Values Description
subset column name or list Columns to consider for identifying duplicates
keep 'first', 'last', False Which duplicate to keep (or remove all)
inplace True, False Modify DataFrame in place (deprecated pattern)
ignore_index True, False Reset index in result (Pandas 1.0+)

The ignore_index=True parameter (available in Pandas 1.0+) is a cleaner alternative to chaining reset_index(drop=True):

df_clean = df.drop_duplicates(ignore_index=True)

For the official documentation and edge cases, consult the Pandas drop_duplicates documentation.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.