How to Drop Duplicates in Polars

Key Insights

Polars provides the unique() method for dropping duplicates, offering more control than pandas with parameters like keep, subset, and maintain_order
The keep parameter gives you four options (first, last, any, none) compared to pandas’ three, with any being the fastest when row order doesn’t matter
Using lazy evaluation with unique() enables Polars to optimize query execution for large datasets, often yielding significant performance improvements

Why Duplicate Handling Matters

Duplicate rows corrupt analysis. They inflate counts, skew aggregations, and break joins. Every data pipeline needs a reliable deduplication strategy.

Polars handles duplicates efficiently because it processes data in parallel and uses Apache Arrow’s columnar format. Where pandas might struggle with millions of rows, Polars stays fast. The unique() method is your primary tool, and understanding its parameters will save you hours of debugging and performance tuning.

Let’s work through the practical ways to drop duplicates in Polars, from basic usage to optimized lazy evaluation.

Basic Duplicate Removal with unique()

The simplest approach removes rows that are identical across all columns. Call unique() with no arguments:

import polars as pl

df = pl.DataFrame({
    "user_id": [1, 2, 2, 3, 3, 3],
    "action": ["login", "click", "click", "purchase", "purchase", "purchase"],
    "timestamp": ["2024-01-01", "2024-01-02", "2024-01-02", "2024-01-03", "2024-01-03", "2024-01-03"]
})

print("Original DataFrame:")
print(df)

print("\nAfter removing duplicates:")
print(df.unique())

Output:

Original DataFrame:
shape: (6, 3)
┌─────────┬──────────┬────────────┐
│ user_id ┆ action   ┆ timestamp  │
│ ---     ┆ ---      ┆ ---        │
│ i64     ┆ str      ┆ str        │
╞═════════╪══════════╪════════════╡
│ 1       ┆ login    ┆ 2024-01-01 │
│ 2       ┆ click    ┆ 2024-01-02 │
│ 2       ┆ click    ┆ 2024-01-02 │
│ 3       ┆ purchase ┆ 2024-01-03 │
│ 3       ┆ purchase ┆ 2024-01-03 │
│ 3       ┆ purchase ┆ 2024-01-03 │
└─────────┴──────────┴────────────┘

After removing duplicates:
shape: (3, 3)
┌─────────┬──────────┬────────────┐
│ user_id ┆ action   ┆ timestamp  │
│ ---     ┆ ---      ┆ ---        │
│ i64     ┆ str      ┆ str        │
╞═════════╪══════════╪════════════╡
│ 1       ┆ login    ┆ 2024-01-01 │
│ 2       ┆ click    ┆ 2024-01-02 │
│ 3       ┆ purchase ┆ 2024-01-03 │
└─────────┴──────────┴────────────┘

This compares every column. Rows must match exactly to be considered duplicates.

Dropping Duplicates on Specific Columns

Real-world deduplication rarely involves all columns. You typically want unique combinations of key fields while keeping other data. The subset parameter handles this:

df = pl.DataFrame({
    "user_id": [1, 1, 2, 2, 3],
    "email": ["a@test.com", "a@test.com", "b@test.com", "b@test.com", "c@test.com"],
    "login_count": [5, 10, 3, 7, 2],
    "last_seen": ["2024-01-01", "2024-01-15", "2024-01-02", "2024-01-20", "2024-01-03"]
})

print("Original DataFrame:")
print(df)

# Keep unique users based on user_id only
print("\nUnique by user_id:")
print(df.unique(subset=["user_id"]))

# Keep unique combinations of user_id and email
print("\nUnique by user_id and email:")
print(df.unique(subset=["user_id", "email"]))

Output:

Original DataFrame:
shape: (5, 4)
┌─────────┬────────────┬─────────────┬────────────┐
│ user_id ┆ email      ┆ login_count ┆ last_seen  │
│ ---     ┆ ---        ┆ ---         ┆ ---        │
│ i64     ┆ str        ┆ i64         ┆ str        │
╞═════════╪════════════╪═════════════╪════════════╡
│ 1       ┆ a@test.com ┆ 5           ┆ 2024-01-01 │
│ 1       ┆ a@test.com ┆ 10          ┆ 2024-01-15 │
│ 2       ┆ b@test.com ┆ 3           ┆ 2024-01-02 │
│ 2       ┆ b@test.com ┆ 7           ┆ 2024-01-20 │
│ 3       ┆ c@test.com ┆ 2           ┆ 2024-01-03 │
└─────────┴────────────┴─────────────┴────────────┘

Unique by user_id:
shape: (3, 4)
┌─────────┬────────────┬─────────────┬────────────┐
│ user_id ┆ email      ┆ login_count ┆ last_seen  │
│ ---     ┆ ---        ┆ ---         ┆ ---        │
│ i64     ┆ str        ┆ i64         ┆ str        │
╞═════════╪════════════╪═════════════╪════════════╡
│ 1       ┆ a@test.com ┆ 5           ┆ 2024-01-01 │
│ 2       ┆ b@test.com ┆ 3           ┆ 2024-01-02 │
│ 3       ┆ c@test.com ┆ 2           ┆ 2024-01-03 │
└─────────┴────────────┴─────────────┴────────────┘

When using subset, Polars keeps one row per unique combination of the specified columns. The values in other columns come from whichever row Polars selects—which brings us to the keep parameter.

Controlling Which Duplicate to Keep

The keep parameter determines which row survives when duplicates exist. Polars offers four options:

first: Keep the first occurrence (based on row order)
last: Keep the last occurrence
any: Keep any occurrence (fastest, non-deterministic)
none: Remove all duplicates entirely

df = pl.DataFrame({
    "product_id": ["A", "A", "A", "B", "B"],
    "price": [100, 95, 90, 50, 55],
    "date": ["2024-01-01", "2024-01-15", "2024-01-30", "2024-01-01", "2024-01-15"]
})

print("Original DataFrame:")
print(df)

print("\nKeep first occurrence:")
print(df.unique(subset=["product_id"], keep="first", maintain_order=True))

print("\nKeep last occurrence:")
print(df.unique(subset=["product_id"], keep="last", maintain_order=True))

print("\nKeep none (remove all duplicates):")
print(df.unique(subset=["product_id"], keep="none"))

Output:

Original DataFrame:
shape: (5, 3)
┌────────────┬───────┬────────────┐
│ product_id ┆ price ┆ date       │
│ ---        ┆ ---   ┆ ---        │
│ str        ┆ i64   ┆ str        │
╞════════════╪═══════╪════════════╡
│ A          ┆ 100   ┆ 2024-01-01 │
│ A          ┆ 95    ┆ 2024-01-15 │
│ A          ┆ 90    ┆ 2024-01-30 │
│ B          ┆ 50    ┆ 2024-01-01 │
│ B          ┆ 55    ┆ 2024-01-15 │
└────────────┴───────┴────────────┘

Keep first occurrence:
shape: (2, 3)
┌────────────┬───────┬────────────┐
│ product_id ┆ price ┆ date       │
│ ---        ┆ ---   ┆ ---        │
│ str        ┆ i64   ┆ str        │
╞════════════╪═══════╪════════════╡
│ A          ┆ 100   ┆ 2024-01-01 │
│ B          ┆ 50    ┆ 2024-01-01 │
└────────────┴───────┴────────────┘

Keep last occurrence:
shape: (2, 3)
┌────────────┬───────┬────────────┐
│ product_id ┆ price ┆ date       │
│ ---        ┆ ---   ┆ ---        │
│ str        ┆ i64   ┆ str        │
╞════════════╪═══════╪════════════╡
│ A          ┆ 90    ┆ 2024-01-30 │
│ B          ┆ 55    ┆ 2024-01-15 │
└────────────┴───────┴────────────┘

Keep none (remove all duplicates):
shape: (0, 3)
┌────────────┬───────┬──────┐
│ product_id ┆ price ┆ date │
│ ---        ┆ ---   ┆ ---  │
│ str        ┆ i64   ┆ str  │
╞════════════╪═══════╪══════╡
└────────────┴───────┴──────┘

The keep="none" option is particularly useful for finding and removing problematic data. If a product_id should be unique and isn’t, keep="none" eliminates all offending rows so you can investigate.

Use keep="any" when you genuinely don’t care which row survives. It’s the fastest option because Polars can parallelize without tracking row positions.

Maintaining Row Order

By default, Polars doesn’t guarantee output order. This enables parallel processing optimizations. If order matters, set maintain_order=True:

import polars as pl

df = pl.DataFrame({
    "id": [3, 1, 2, 1, 3, 2],
    "value": ["c", "a", "b", "a2", "c2", "b2"]
})

print("Original DataFrame:")
print(df)

print("\nWithout maintain_order (order may vary):")
print(df.unique(subset=["id"], keep="first"))

print("\nWith maintain_order=True:")
print(df.unique(subset=["id"], keep="first", maintain_order=True))

Output:

Original DataFrame:
shape: (6, 2)
┌─────┬───────┐
│ id  ┆ value │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 3   ┆ c     │
│ 1   ┆ a     │
│ 2   ┆ b     │
│ 1   ┆ a2    │
│ 3   ┆ c2    │
│ 2   ┆ b2    │
└─────┴───────┘

With maintain_order=True:
shape: (3, 2)
┌─────┬───────┐
│ id  ┆ value │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 3   ┆ c     │
│ 1   ┆ a     │
│ 2   ┆ b     │
└─────┴───────┘

The performance cost of maintain_order=True is real but often acceptable. For most datasets under 10 million rows, you won’t notice. For larger datasets or performance-critical pipelines, benchmark both approaches.

Important: When using keep="first" or keep="last", you almost always want maintain_order=True. Without it, “first” and “last” become meaningless because row order isn’t preserved.

Lazy Frame Optimization

For large datasets, lazy evaluation lets Polars optimize your entire query before execution. The unique() method works seamlessly with lazy frames:

import polars as pl

# Simulate a larger dataset
df = pl.DataFrame({
    "user_id": list(range(100000)) * 3,
    "session_id": list(range(300000)),
    "action": ["click"] * 300000
})

# Eager execution
result_eager = df.unique(subset=["user_id"])

# Lazy execution
result_lazy = (
    df.lazy()
    .unique(subset=["user_id"], maintain_order=True)
    .collect()
)

print(f"Eager result shape: {result_eager.shape}")
print(f"Lazy result shape: {result_lazy.shape}")

The lazy approach shines when you chain multiple operations:

result = (
    pl.scan_parquet("large_file.parquet")
    .filter(pl.col("status") == "active")
    .unique(subset=["customer_id"], keep="last")
    .select(["customer_id", "email", "last_purchase"])
    .collect()
)

Polars optimizes this entire chain. It might push the filter before reading all columns, or combine operations to minimize memory usage. You get these optimizations for free by using lazy evaluation.

For datasets over a few million rows, always prefer the lazy API. The query optimizer frequently cuts execution time by 30-50% compared to eager evaluation.

Conclusion

Dropping duplicates in Polars is straightforward once you understand the parameters:

Use unique() with no arguments for exact row deduplication
Use subset to deduplicate based on specific columns
Use keep to control which duplicate survives (first, last, any, none)
Use maintain_order=True when row order matters, especially with keep="first" or keep="last"
Use lazy evaluation for large datasets to benefit from query optimization

The default keep="any" is fastest when you don’t care which row survives. Switch to keep="first" or keep="last" when you need deterministic results based on row position.

For production pipelines processing millions of rows, combine lazy evaluation with explicit subset and keep parameters. This gives you both performance and predictable behavior.

Why Duplicate Handling Matters

Basic Duplicate Removal with unique()

Dropping Duplicates on Specific Columns

Controlling Which Duplicate to Keep

Maintaining Row Order

Lazy Frame Optimization

Conclusion

Liked this? There's more.

Similar Articles