SQL - SELECT DISTINCT with Examples | Application Architect

Key Insights

SELECT DISTINCT eliminates duplicate rows from query results by comparing all selected columns, with NULL values treated as equal to each other
DISTINCT applies to the entire row, not individual columns—using DISTINCT on multiple columns returns unique combinations of those columns
Performance implications vary significantly: DISTINCT requires sorting or hashing operations, making it expensive on large datasets without proper indexing

Understanding DISTINCT Fundamentals

SELECT DISTINCT filters duplicate rows from your result set. The operation examines all columns in your SELECT clause and returns only unique combinations.

-- Basic DISTINCT usage
SELECT DISTINCT country 
FROM customers;

-- Returns each country exactly once
-- USA
-- Canada
-- Mexico

DISTINCT operates on the entire row. When you specify multiple columns, the uniqueness check applies to the combination of all columns:

-- DISTINCT on multiple columns
SELECT DISTINCT country, city 
FROM customers;

-- Returns unique country-city pairs
-- USA, New York
-- USA, Los Angeles
-- Canada, Toronto
-- Canada, Toronto  -- Won't appear twice

This differs fundamentally from applying DISTINCT to individual columns. You cannot write SELECT DISTINCT country, city and expect DISTINCT to apply only to country—it evaluates the entire row.

DISTINCT vs GROUP BY

GROUP BY achieves similar deduplication but serves different purposes. Use DISTINCT for simple deduplication; use GROUP BY when you need aggregations.

-- These queries return identical results
SELECT DISTINCT department_id 
FROM employees;

SELECT department_id 
FROM employees 
GROUP BY department_id;

-- GROUP BY becomes necessary with aggregations
SELECT department_id, COUNT(*) as employee_count
FROM employees
GROUP BY department_id;

-- DISTINCT cannot perform aggregations
-- This is INVALID:
-- SELECT DISTINCT department_id, COUNT(*) FROM employees;

Performance characteristics differ. DISTINCT typically performs a single pass with hash-based deduplication, while GROUP BY may use sorting or hashing depending on the optimizer’s decision. For simple deduplication without aggregations, DISTINCT often performs better.

Handling NULL Values

DISTINCT treats all NULL values as equal. If multiple rows contain NULL in the DISTINCT column(s), only one NULL appears in results.

CREATE TABLE products (
    id INT,
    category VARCHAR(50),
    subcategory VARCHAR(50)
);

INSERT INTO products VALUES
(1, 'Electronics', 'Phones'),
(2, 'Electronics', NULL),
(3, 'Clothing', NULL),
(4, 'Electronics', NULL);

SELECT DISTINCT category, subcategory 
FROM products;

-- Results:
-- Electronics, Phones
-- Electronics, NULL    -- Only one NULL combination
-- Clothing, NULL

This behavior proves useful when cleaning data with missing values, but be aware that DISTINCT collapses all NULL variations into a single row.

DISTINCT with Expressions and Functions

DISTINCT works with expressions, function results, and calculated columns. The uniqueness check applies to the computed values.

-- DISTINCT on expressions
SELECT DISTINCT UPPER(email_domain) 
FROM users;

-- DISTINCT with calculations
SELECT DISTINCT ROUND(price, -1) as price_rounded
FROM products;

-- DISTINCT with CASE expressions
SELECT DISTINCT 
    CASE 
        WHEN age < 18 THEN 'Minor'
        WHEN age < 65 THEN 'Adult'
        ELSE 'Senior'
    END as age_group
FROM customers;

The database evaluates the expression for each row, then applies DISTINCT to the results. This means the expression executes for every row before deduplication occurs—a performance consideration for expensive functions.

DISTINCT in Subqueries and CTEs

DISTINCT frequently appears in subqueries to ensure unique values for IN clauses or joins.

-- Subquery with DISTINCT
SELECT * 
FROM orders 
WHERE customer_id IN (
    SELECT DISTINCT customer_id 
    FROM customers 
    WHERE country = 'USA'
);

-- CTE with DISTINCT
WITH unique_categories AS (
    SELECT DISTINCT category, supplier_id
    FROM products
    WHERE active = true
)
SELECT c.category, COUNT(*) as supplier_count
FROM unique_categories c
GROUP BY c.category;

Using DISTINCT in subqueries can improve performance by reducing the number of rows passed to outer queries, but it adds deduplication overhead. Profile both approaches.

DISTINCT COUNT Patterns

Counting distinct values represents one of the most common DISTINCT use cases.

-- Count distinct values
SELECT COUNT(DISTINCT customer_id) as unique_customers
FROM orders
WHERE order_date >= '2024-01-01';

-- Multiple distinct counts
SELECT 
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(DISTINCT product_id) as unique_products,
    COUNT(*) as total_orders
FROM orders;

-- DISTINCT count with conditions
SELECT COUNT(DISTINCT CASE 
    WHEN order_status = 'completed' 
    THEN customer_id 
END) as customers_with_completed_orders
FROM orders;

COUNT(DISTINCT column) performs differently than SELECT DISTINCT with COUNT. The former counts unique values; the latter requires grouping to achieve similar results.

Performance Considerations

DISTINCT operations require the database to identify and eliminate duplicates, which involves sorting or hashing—expensive operations on large datasets.

-- Inefficient: DISTINCT on large table without index
SELECT DISTINCT email 
FROM users;  -- Table scan + deduplication

-- Better: Index supports the operation
CREATE INDEX idx_users_email ON users(email);
SELECT DISTINCT email 
FROM users;  -- Index scan + deduplication

-- Best: Eliminate need for DISTINCT through schema design
-- Use a separate emails table with unique constraint
CREATE TABLE user_emails (
    email VARCHAR(255) PRIMARY KEY
);

Execution plans reveal DISTINCT costs:

-- PostgreSQL
EXPLAIN ANALYZE
SELECT DISTINCT category FROM products;

-- MySQL
EXPLAIN FORMAT=JSON
SELECT DISTINCT category FROM products;

-- SQL Server
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT DISTINCT category FROM products;

Look for “HashAggregate”, “Sort”, or “Distinct” operations in the plan. High row counts entering these operations indicate performance problems.

Alternatives to DISTINCT

Sometimes EXISTS, window functions, or schema changes provide better solutions than DISTINCT.

-- Instead of DISTINCT in subquery
SELECT * FROM customers c
WHERE customer_id IN (
    SELECT DISTINCT customer_id FROM orders
);

-- Use EXISTS (often faster)
SELECT * FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.customer_id
);

-- Instead of DISTINCT for ranking
SELECT DISTINCT ON (category) category, product_name, price
FROM products
ORDER BY category, price DESC;  -- PostgreSQL only

-- Use ROW_NUMBER (standard SQL)
WITH ranked AS (
    SELECT category, product_name, price,
           ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) as rn
    FROM products
)
SELECT category, product_name, price
FROM ranked
WHERE rn = 1;

Common Pitfalls

DISTINCT doesn’t work as many developers expect when combined with other clauses:

-- DISTINCT with ORDER BY requires ordered columns in SELECT
SELECT DISTINCT country 
FROM customers 
ORDER BY city;  -- ERROR: city not in SELECT

-- Fix: Include the column
SELECT DISTINCT country, city 
FROM customers 
ORDER BY city;

-- DISTINCT with LIMIT returns distinct rows first, then limits
SELECT DISTINCT category 
FROM products 
LIMIT 10;  -- Gets 10 distinct categories

-- Not the same as:
SELECT category 
FROM products 
LIMIT 10;  -- Gets first 10 rows, then DISTINCT would apply

DISTINCT applies before LIMIT, which matters for result sets and performance. The database must find all distinct values before limiting results.