SQL: UNION vs UNION ALL

Key Insights

UNION removes duplicates by performing an implicit DISTINCT operation, while UNION ALL preserves all rows including duplicates, making it significantly faster for large datasets
Both operators require queries to have the same number of columns with compatible data types, and ORDER BY clauses must appear after the final query in the set
Default to UNION ALL unless you specifically need deduplication—unnecessary duplicate removal can create performance bottlenecks that scale poorly with data volume

Introduction

Set operations are fundamental to SQL, allowing you to combine results from multiple queries into a single result set. Whether you’re merging customer records from different regional databases, consolidating log entries, or building complex reporting queries, you’ll frequently need to stack query results vertically. SQL provides two primary operators for this task: UNION and UNION ALL.

The difference between these operators is deceptively simple—one removes duplicates, the other doesn’t—but the performance and practical implications run deep. Choose the wrong operator and you might introduce unnecessary computational overhead that tanks your query performance. Choose correctly and you’ll write efficient, maintainable SQL that scales with your data.

UNION: Combining with Deduplication

UNION merges the results of two or more SELECT statements while automatically removing duplicate rows. It performs an implicit DISTINCT operation across the entire combined result set, comparing every column to identify and eliminate duplicates.

Here’s a straightforward example using employee data:

-- First query: employees from the engineering department
SELECT employee_id, name, email
FROM employees
WHERE department = 'Engineering'

UNION

-- Second query: employees working on Project X
SELECT employee_id, name, email
FROM project_assignments
WHERE project_name = 'Project X';

If Alice works in Engineering and is assigned to Project X, she appears in both individual query results. UNION ensures she appears only once in the final output. The database engine sorts and compares all rows to identify and remove this duplication.

This deduplication comes at a computational cost. The database must:

Execute both queries and store their results
Sort the combined result set (or use a hash table)
Compare rows to identify duplicates
Remove duplicate entries
Return the final deduplicated set

For small result sets, this overhead is negligible. For queries returning thousands or millions of rows, the performance impact becomes substantial.

UNION ALL: Combining Without Deduplication

UNION ALL takes a simpler approach: it combines all rows from multiple queries without any duplicate checking or removal. Every row from every query appears in the final result set, even if identical rows exist.

Using the same employee scenario:

-- First query: employees from the engineering department
SELECT employee_id, name, email
FROM employees
WHERE department = 'Engineering'

UNION ALL

-- Second query: employees working on Project X
SELECT employee_id, name, email
FROM project_assignments
WHERE project_name = 'Project X';

Now if Alice appears in both queries, she’ll appear twice in the result set. The database simply executes both queries and concatenates the results—no sorting, no comparison, no deduplication overhead.

This makes UNION ALL significantly faster, especially with large datasets. The database only needs to:

Execute both queries
Append the second result set to the first
Return the combined set

The performance difference scales linearly with your data volume. With millions of rows, UNION ALL can be orders of magnitude faster than UNION.

Performance Comparison

Let’s examine the actual execution plans in SQLite to understand the performance difference. First, create some test data:

CREATE TABLE orders_2023 (
    order_id INTEGER,
    customer_id INTEGER,
    amount DECIMAL(10,2)
);

CREATE TABLE orders_2024 (
    order_id INTEGER,
    customer_id INTEGER,
    amount DECIMAL(10,2)
);

-- Insert sample data
INSERT INTO orders_2023 VALUES (1, 100, 250.00), (2, 101, 175.50);
INSERT INTO orders_2024 VALUES (3, 100, 300.00), (4, 102, 425.75);

Now examine the query plans:

-- UNION execution plan
EXPLAIN QUERY PLAN
SELECT customer_id, amount FROM orders_2023
UNION
SELECT customer_id, amount FROM orders_2024;

-- Output shows:
-- SCAN orders_2023
-- SCAN orders_2024
-- USE TEMP B-TREE FOR ORDER BY

-- UNION ALL execution plan
EXPLAIN QUERY PLAN
SELECT customer_id, amount FROM orders_2023
UNION ALL
SELECT customer_id, amount FROM orders_2024;

-- Output shows:
-- SCAN orders_2023
-- SCAN orders_2024

The UNION query creates a temporary B-tree structure for sorting and deduplication. This requires additional memory allocation, disk I/O (if the temp structure spills to disk), and CPU cycles for comparison operations. UNION ALL skips all of this, simply scanning both tables and returning the combined results.

In production systems with millions of rows, this difference translates to queries that take seconds versus minutes, or minutes versus hours.

Rules and Requirements

Both UNION and UNION ALL have strict requirements that must be satisfied:

Column Count Must Match: Every SELECT statement must return the same number of columns.

-- WRONG: Column count mismatch
SELECT customer_id, name FROM customers
UNION
SELECT order_id FROM orders;  -- Error: different column count

-- CORRECT: Matching column counts
SELECT customer_id, name FROM customers
UNION
SELECT customer_id, contact_name FROM suppliers;

Data Types Must Be Compatible: Corresponding columns must have compatible data types. SQLite is more forgiving than other databases, performing implicit conversions, but relying on this is poor practice.

-- Problematic: Mixing incompatible types
SELECT customer_id, name FROM customers  -- customer_id is INTEGER
UNION
SELECT account_code, company FROM vendors;  -- account_code is TEXT

-- Better: Explicit casting
SELECT CAST(customer_id AS TEXT), name FROM customers
UNION
SELECT account_code, company FROM vendors;

ORDER BY Placement: You can only use ORDER BY after the final query in the set. It applies to the entire combined result.

-- WRONG: ORDER BY in individual queries
SELECT name FROM customers ORDER BY name
UNION
SELECT name FROM suppliers ORDER BY name;

-- CORRECT: ORDER BY after the union
SELECT name FROM customers
UNION
SELECT name FROM suppliers
ORDER BY name;

Column Names: The final result set uses column names from the first SELECT statement. Subsequent queries’ column names are ignored.

SELECT customer_id AS id, customer_name AS name FROM customers
UNION ALL
SELECT supplier_id, supplier_company FROM suppliers;
-- Result columns are named: id, name

Practical Use Cases

Understanding when to use each operator is critical for writing efficient queries.

Use UNION when data integrity requires deduplication:

-- Merging customer data from regional databases
-- Must eliminate duplicates for customers in multiple regions
SELECT customer_id, email, phone, 'US' as region
FROM us_customers
UNION
SELECT customer_id, email, phone, 'EU' as region
FROM eu_customers
UNION
SELECT customer_id, email, phone, 'ASIA' as region
FROM asia_customers;

If a customer has accounts in multiple regions, UNION ensures they appear once in your consolidated view. The deduplication overhead is justified because duplicate customer records would corrupt your analysis or reporting.

Use UNION ALL when sources are known to be distinct or duplicates don’t matter:

-- Combining application logs from different services
-- Each log entry is unique by definition
SELECT timestamp, service_name, log_level, message
FROM auth_service_logs
WHERE timestamp >= datetime('now', '-1 day')
UNION ALL
SELECT timestamp, service_name, log_level, message
FROM api_service_logs
WHERE timestamp >= datetime('now', '-1 day')
UNION ALL
SELECT timestamp, service_name, log_level, message
FROM worker_service_logs
WHERE timestamp >= datetime('now', '-1 day')
ORDER BY timestamp DESC;

Log entries from different services are inherently distinct. Even if identical messages occur, they represent separate events that should both appear in your consolidated log view. UNION ALL provides the performance you need when processing high-volume log data.

Another common UNION ALL scenario is partitioned tables:

-- Querying across time-partitioned order tables
SELECT order_id, order_date, total
FROM orders_q1_2024
WHERE total > 1000
UNION ALL
SELECT order_id, order_date, total
FROM orders_q2_2024
WHERE total > 1000
UNION ALL
SELECT order_id, order_date, total
FROM orders_q3_2024
WHERE total > 1000;

Since each partition contains distinct time periods, duplicates are impossible. UNION ALL gives you the fastest possible query across partitions.

Best Practices and Recommendations

Follow this decision framework when choosing between UNION and UNION ALL:

Default to UNION ALL. It’s faster and more explicit about your intentions. Only use UNION when you have a specific requirement for deduplication.

If you need UNION, question whether your data model is correct. Frequent need for UNION often indicates denormalized data or missing constraints. Consider whether better schema design could eliminate duplicate sources at the data layer.

Profile before optimizing. If your UNION query performs adequately and the deduplication is semantically necessary, don’t prematurely optimize to UNION ALL. Correctness trumps performance.

Document your choice. When using UNION ALL on sources that might contain duplicates, add a comment explaining why duplicates are acceptable. When using UNION, note why deduplication is required. Future maintainers will thank you.

Consider application-level deduplication for complex cases. If you need sophisticated duplicate detection (fuzzy matching, complex business rules), doing it in application code may be clearer and more maintainable than forcing it into SQL.

The UNION vs UNION ALL decision seems minor, but it has outsized impact on query performance and data correctness. Choose deliberately, understand the tradeoffs, and your SQL will be both fast and correct.