SQL - SELECT DISTINCT with Examples
SELECT DISTINCT filters duplicate rows from your result set. The operation examines all columns in your SELECT clause and returns only unique combinations.
Key Insights
- SELECT DISTINCT eliminates duplicate rows from query results by comparing all selected columns, with NULL values treated as equal to each other
- DISTINCT applies to the entire row, not individual columns—using DISTINCT on multiple columns returns unique combinations of those columns
- Performance implications vary significantly: DISTINCT requires sorting or hashing operations, making it expensive on large datasets without proper indexing
Understanding DISTINCT Fundamentals
SELECT DISTINCT filters duplicate rows from your result set. The operation examines all columns in your SELECT clause and returns only unique combinations.
-- Basic DISTINCT usage
SELECT DISTINCT country
FROM customers;
-- Returns each country exactly once
-- USA
-- Canada
-- Mexico
DISTINCT operates on the entire row. When you specify multiple columns, the uniqueness check applies to the combination of all columns:
-- DISTINCT on multiple columns
SELECT DISTINCT country, city
FROM customers;
-- Returns unique country-city pairs
-- USA, New York
-- USA, Los Angeles
-- Canada, Toronto
-- Canada, Toronto -- Won't appear twice
This differs fundamentally from applying DISTINCT to individual columns. You cannot write SELECT DISTINCT country, city and expect DISTINCT to apply only to country—it evaluates the entire row.
DISTINCT vs GROUP BY
GROUP BY achieves similar deduplication but serves different purposes. Use DISTINCT for simple deduplication; use GROUP BY when you need aggregations.
-- These queries return identical results
SELECT DISTINCT department_id
FROM employees;
SELECT department_id
FROM employees
GROUP BY department_id;
-- GROUP BY becomes necessary with aggregations
SELECT department_id, COUNT(*) as employee_count
FROM employees
GROUP BY department_id;
-- DISTINCT cannot perform aggregations
-- This is INVALID:
-- SELECT DISTINCT department_id, COUNT(*) FROM employees;
Performance characteristics differ. DISTINCT typically performs a single pass with hash-based deduplication, while GROUP BY may use sorting or hashing depending on the optimizer’s decision. For simple deduplication without aggregations, DISTINCT often performs better.
Handling NULL Values
DISTINCT treats all NULL values as equal. If multiple rows contain NULL in the DISTINCT column(s), only one NULL appears in results.
CREATE TABLE products (
id INT,
category VARCHAR(50),
subcategory VARCHAR(50)
);
INSERT INTO products VALUES
(1, 'Electronics', 'Phones'),
(2, 'Electronics', NULL),
(3, 'Clothing', NULL),
(4, 'Electronics', NULL);
SELECT DISTINCT category, subcategory
FROM products;
-- Results:
-- Electronics, Phones
-- Electronics, NULL -- Only one NULL combination
-- Clothing, NULL
This behavior proves useful when cleaning data with missing values, but be aware that DISTINCT collapses all NULL variations into a single row.
DISTINCT with Expressions and Functions
DISTINCT works with expressions, function results, and calculated columns. The uniqueness check applies to the computed values.
-- DISTINCT on expressions
SELECT DISTINCT UPPER(email_domain)
FROM users;
-- DISTINCT with calculations
SELECT DISTINCT ROUND(price, -1) as price_rounded
FROM products;
-- DISTINCT with CASE expressions
SELECT DISTINCT
CASE
WHEN age < 18 THEN 'Minor'
WHEN age < 65 THEN 'Adult'
ELSE 'Senior'
END as age_group
FROM customers;
The database evaluates the expression for each row, then applies DISTINCT to the results. This means the expression executes for every row before deduplication occurs—a performance consideration for expensive functions.
DISTINCT in Subqueries and CTEs
DISTINCT frequently appears in subqueries to ensure unique values for IN clauses or joins.
-- Subquery with DISTINCT
SELECT *
FROM orders
WHERE customer_id IN (
SELECT DISTINCT customer_id
FROM customers
WHERE country = 'USA'
);
-- CTE with DISTINCT
WITH unique_categories AS (
SELECT DISTINCT category, supplier_id
FROM products
WHERE active = true
)
SELECT c.category, COUNT(*) as supplier_count
FROM unique_categories c
GROUP BY c.category;
Using DISTINCT in subqueries can improve performance by reducing the number of rows passed to outer queries, but it adds deduplication overhead. Profile both approaches.
DISTINCT COUNT Patterns
Counting distinct values represents one of the most common DISTINCT use cases.
-- Count distinct values
SELECT COUNT(DISTINCT customer_id) as unique_customers
FROM orders
WHERE order_date >= '2024-01-01';
-- Multiple distinct counts
SELECT
COUNT(DISTINCT customer_id) as unique_customers,
COUNT(DISTINCT product_id) as unique_products,
COUNT(*) as total_orders
FROM orders;
-- DISTINCT count with conditions
SELECT COUNT(DISTINCT CASE
WHEN order_status = 'completed'
THEN customer_id
END) as customers_with_completed_orders
FROM orders;
COUNT(DISTINCT column) performs differently than SELECT DISTINCT with COUNT. The former counts unique values; the latter requires grouping to achieve similar results.
Performance Considerations
DISTINCT operations require the database to identify and eliminate duplicates, which involves sorting or hashing—expensive operations on large datasets.
-- Inefficient: DISTINCT on large table without index
SELECT DISTINCT email
FROM users; -- Table scan + deduplication
-- Better: Index supports the operation
CREATE INDEX idx_users_email ON users(email);
SELECT DISTINCT email
FROM users; -- Index scan + deduplication
-- Best: Eliminate need for DISTINCT through schema design
-- Use a separate emails table with unique constraint
CREATE TABLE user_emails (
email VARCHAR(255) PRIMARY KEY
);
Execution plans reveal DISTINCT costs:
-- PostgreSQL
EXPLAIN ANALYZE
SELECT DISTINCT category FROM products;
-- MySQL
EXPLAIN FORMAT=JSON
SELECT DISTINCT category FROM products;
-- SQL Server
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT DISTINCT category FROM products;
Look for “HashAggregate”, “Sort”, or “Distinct” operations in the plan. High row counts entering these operations indicate performance problems.
Alternatives to DISTINCT
Sometimes EXISTS, window functions, or schema changes provide better solutions than DISTINCT.
-- Instead of DISTINCT in subquery
SELECT * FROM customers c
WHERE customer_id IN (
SELECT DISTINCT customer_id FROM orders
);
-- Use EXISTS (often faster)
SELECT * FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
);
-- Instead of DISTINCT for ranking
SELECT DISTINCT ON (category) category, product_name, price
FROM products
ORDER BY category, price DESC; -- PostgreSQL only
-- Use ROW_NUMBER (standard SQL)
WITH ranked AS (
SELECT category, product_name, price,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) as rn
FROM products
)
SELECT category, product_name, price
FROM ranked
WHERE rn = 1;
Common Pitfalls
DISTINCT doesn’t work as many developers expect when combined with other clauses:
-- DISTINCT with ORDER BY requires ordered columns in SELECT
SELECT DISTINCT country
FROM customers
ORDER BY city; -- ERROR: city not in SELECT
-- Fix: Include the column
SELECT DISTINCT country, city
FROM customers
ORDER BY city;
-- DISTINCT with LIMIT returns distinct rows first, then limits
SELECT DISTINCT category
FROM products
LIMIT 10; -- Gets 10 distinct categories
-- Not the same as:
SELECT category
FROM products
LIMIT 10; -- Gets first 10 rows, then DISTINCT would apply
DISTINCT applies before LIMIT, which matters for result sets and performance. The database must find all distinct values before limiting results.