How to Drop Duplicates in Polars
Duplicate rows corrupt analysis. They inflate counts, skew aggregations, and break joins. Every data pipeline needs a reliable deduplication strategy.
Key Insights
- Polars provides the
unique()method for dropping duplicates, offering more control than pandas with parameters likekeep,subset, andmaintain_order - The
keepparameter gives you four options (first,last,any,none) compared to pandas’ three, withanybeing the fastest when row order doesn’t matter - Using lazy evaluation with
unique()enables Polars to optimize query execution for large datasets, often yielding significant performance improvements
Why Duplicate Handling Matters
Duplicate rows corrupt analysis. They inflate counts, skew aggregations, and break joins. Every data pipeline needs a reliable deduplication strategy.
Polars handles duplicates efficiently because it processes data in parallel and uses Apache Arrow’s columnar format. Where pandas might struggle with millions of rows, Polars stays fast. The unique() method is your primary tool, and understanding its parameters will save you hours of debugging and performance tuning.
Let’s work through the practical ways to drop duplicates in Polars, from basic usage to optimized lazy evaluation.
Basic Duplicate Removal with unique()
The simplest approach removes rows that are identical across all columns. Call unique() with no arguments:
import polars as pl
df = pl.DataFrame({
"user_id": [1, 2, 2, 3, 3, 3],
"action": ["login", "click", "click", "purchase", "purchase", "purchase"],
"timestamp": ["2024-01-01", "2024-01-02", "2024-01-02", "2024-01-03", "2024-01-03", "2024-01-03"]
})
print("Original DataFrame:")
print(df)
print("\nAfter removing duplicates:")
print(df.unique())
Output:
Original DataFrame:
shape: (6, 3)
┌─────────┬──────────┬────────────┐
│ user_id ┆ action ┆ timestamp │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════════╪══════════╪════════════╡
│ 1 ┆ login ┆ 2024-01-01 │
│ 2 ┆ click ┆ 2024-01-02 │
│ 2 ┆ click ┆ 2024-01-02 │
│ 3 ┆ purchase ┆ 2024-01-03 │
│ 3 ┆ purchase ┆ 2024-01-03 │
│ 3 ┆ purchase ┆ 2024-01-03 │
└─────────┴──────────┴────────────┘
After removing duplicates:
shape: (3, 3)
┌─────────┬──────────┬────────────┐
│ user_id ┆ action ┆ timestamp │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════════╪══════════╪════════════╡
│ 1 ┆ login ┆ 2024-01-01 │
│ 2 ┆ click ┆ 2024-01-02 │
│ 3 ┆ purchase ┆ 2024-01-03 │
└─────────┴──────────┴────────────┘
This compares every column. Rows must match exactly to be considered duplicates.
Dropping Duplicates on Specific Columns
Real-world deduplication rarely involves all columns. You typically want unique combinations of key fields while keeping other data. The subset parameter handles this:
df = pl.DataFrame({
"user_id": [1, 1, 2, 2, 3],
"email": ["a@test.com", "a@test.com", "b@test.com", "b@test.com", "c@test.com"],
"login_count": [5, 10, 3, 7, 2],
"last_seen": ["2024-01-01", "2024-01-15", "2024-01-02", "2024-01-20", "2024-01-03"]
})
print("Original DataFrame:")
print(df)
# Keep unique users based on user_id only
print("\nUnique by user_id:")
print(df.unique(subset=["user_id"]))
# Keep unique combinations of user_id and email
print("\nUnique by user_id and email:")
print(df.unique(subset=["user_id", "email"]))
Output:
Original DataFrame:
shape: (5, 4)
┌─────────┬────────────┬─────────────┬────────────┐
│ user_id ┆ email ┆ login_count ┆ last_seen │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞═════════╪════════════╪═════════════╪════════════╡
│ 1 ┆ a@test.com ┆ 5 ┆ 2024-01-01 │
│ 1 ┆ a@test.com ┆ 10 ┆ 2024-01-15 │
│ 2 ┆ b@test.com ┆ 3 ┆ 2024-01-02 │
│ 2 ┆ b@test.com ┆ 7 ┆ 2024-01-20 │
│ 3 ┆ c@test.com ┆ 2 ┆ 2024-01-03 │
└─────────┴────────────┴─────────────┴────────────┘
Unique by user_id:
shape: (3, 4)
┌─────────┬────────────┬─────────────┬────────────┐
│ user_id ┆ email ┆ login_count ┆ last_seen │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞═════════╪════════════╪═════════════╪════════════╡
│ 1 ┆ a@test.com ┆ 5 ┆ 2024-01-01 │
│ 2 ┆ b@test.com ┆ 3 ┆ 2024-01-02 │
│ 3 ┆ c@test.com ┆ 2 ┆ 2024-01-03 │
└─────────┴────────────┴─────────────┴────────────┘
When using subset, Polars keeps one row per unique combination of the specified columns. The values in other columns come from whichever row Polars selects—which brings us to the keep parameter.
Controlling Which Duplicate to Keep
The keep parameter determines which row survives when duplicates exist. Polars offers four options:
first: Keep the first occurrence (based on row order)last: Keep the last occurrenceany: Keep any occurrence (fastest, non-deterministic)none: Remove all duplicates entirely
df = pl.DataFrame({
"product_id": ["A", "A", "A", "B", "B"],
"price": [100, 95, 90, 50, 55],
"date": ["2024-01-01", "2024-01-15", "2024-01-30", "2024-01-01", "2024-01-15"]
})
print("Original DataFrame:")
print(df)
print("\nKeep first occurrence:")
print(df.unique(subset=["product_id"], keep="first", maintain_order=True))
print("\nKeep last occurrence:")
print(df.unique(subset=["product_id"], keep="last", maintain_order=True))
print("\nKeep none (remove all duplicates):")
print(df.unique(subset=["product_id"], keep="none"))
Output:
Original DataFrame:
shape: (5, 3)
┌────────────┬───────┬────────────┐
│ product_id ┆ price ┆ date │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞════════════╪═══════╪════════════╡
│ A ┆ 100 ┆ 2024-01-01 │
│ A ┆ 95 ┆ 2024-01-15 │
│ A ┆ 90 ┆ 2024-01-30 │
│ B ┆ 50 ┆ 2024-01-01 │
│ B ┆ 55 ┆ 2024-01-15 │
└────────────┴───────┴────────────┘
Keep first occurrence:
shape: (2, 3)
┌────────────┬───────┬────────────┐
│ product_id ┆ price ┆ date │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞════════════╪═══════╪════════════╡
│ A ┆ 100 ┆ 2024-01-01 │
│ B ┆ 50 ┆ 2024-01-01 │
└────────────┴───────┴────────────┘
Keep last occurrence:
shape: (2, 3)
┌────────────┬───────┬────────────┐
│ product_id ┆ price ┆ date │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞════════════╪═══════╪════════════╡
│ A ┆ 90 ┆ 2024-01-30 │
│ B ┆ 55 ┆ 2024-01-15 │
└────────────┴───────┴────────────┘
Keep none (remove all duplicates):
shape: (0, 3)
┌────────────┬───────┬──────┐
│ product_id ┆ price ┆ date │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞════════════╪═══════╪══════╡
└────────────┴───────┴──────┘
The keep="none" option is particularly useful for finding and removing problematic data. If a product_id should be unique and isn’t, keep="none" eliminates all offending rows so you can investigate.
Use keep="any" when you genuinely don’t care which row survives. It’s the fastest option because Polars can parallelize without tracking row positions.
Maintaining Row Order
By default, Polars doesn’t guarantee output order. This enables parallel processing optimizations. If order matters, set maintain_order=True:
import polars as pl
df = pl.DataFrame({
"id": [3, 1, 2, 1, 3, 2],
"value": ["c", "a", "b", "a2", "c2", "b2"]
})
print("Original DataFrame:")
print(df)
print("\nWithout maintain_order (order may vary):")
print(df.unique(subset=["id"], keep="first"))
print("\nWith maintain_order=True:")
print(df.unique(subset=["id"], keep="first", maintain_order=True))
Output:
Original DataFrame:
shape: (6, 2)
┌─────┬───────┐
│ id ┆ value │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 3 ┆ c │
│ 1 ┆ a │
│ 2 ┆ b │
│ 1 ┆ a2 │
│ 3 ┆ c2 │
│ 2 ┆ b2 │
└─────┴───────┘
With maintain_order=True:
shape: (3, 2)
┌─────┬───────┐
│ id ┆ value │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 3 ┆ c │
│ 1 ┆ a │
│ 2 ┆ b │
└─────┴───────┘
The performance cost of maintain_order=True is real but often acceptable. For most datasets under 10 million rows, you won’t notice. For larger datasets or performance-critical pipelines, benchmark both approaches.
Important: When using keep="first" or keep="last", you almost always want maintain_order=True. Without it, “first” and “last” become meaningless because row order isn’t preserved.
Lazy Frame Optimization
For large datasets, lazy evaluation lets Polars optimize your entire query before execution. The unique() method works seamlessly with lazy frames:
import polars as pl
# Simulate a larger dataset
df = pl.DataFrame({
"user_id": list(range(100000)) * 3,
"session_id": list(range(300000)),
"action": ["click"] * 300000
})
# Eager execution
result_eager = df.unique(subset=["user_id"])
# Lazy execution
result_lazy = (
df.lazy()
.unique(subset=["user_id"], maintain_order=True)
.collect()
)
print(f"Eager result shape: {result_eager.shape}")
print(f"Lazy result shape: {result_lazy.shape}")
The lazy approach shines when you chain multiple operations:
result = (
pl.scan_parquet("large_file.parquet")
.filter(pl.col("status") == "active")
.unique(subset=["customer_id"], keep="last")
.select(["customer_id", "email", "last_purchase"])
.collect()
)
Polars optimizes this entire chain. It might push the filter before reading all columns, or combine operations to minimize memory usage. You get these optimizations for free by using lazy evaluation.
For datasets over a few million rows, always prefer the lazy API. The query optimizer frequently cuts execution time by 30-50% compared to eager evaluation.
Conclusion
Dropping duplicates in Polars is straightforward once you understand the parameters:
- Use
unique()with no arguments for exact row deduplication - Use
subsetto deduplicate based on specific columns - Use
keepto control which duplicate survives (first,last,any,none) - Use
maintain_order=Truewhen row order matters, especially withkeep="first"orkeep="last" - Use lazy evaluation for large datasets to benefit from query optimization
The default keep="any" is fastest when you don’t care which row survives. Switch to keep="first" or keep="last" when you need deterministic results based on row position.
For production pipelines processing millions of rows, combine lazy evaluation with explicit subset and keep parameters. This gives you both performance and predictable behavior.