SQL - MERGE / UPSERT Statement
MERGE statements solve a common data synchronization problem: you need to insert a row if it doesn't exist, or update it if it does. The naive approach—checking existence with SELECT, then branching...
Key Insights
- MERGE statements combine INSERT, UPDATE, and DELETE operations into a single atomic transaction, eliminating race conditions that occur when checking existence before inserting or updating
- Modern databases offer varying MERGE syntax—SQL Server uses MERGE, PostgreSQL uses INSERT…ON CONFLICT, MySQL uses INSERT…ON DUPLICATE KEY UPDATE or REPLACE
- Performance characteristics differ significantly between implementations; understanding execution plans and proper indexing is critical for production workloads handling millions of rows
Understanding MERGE Operations
MERGE statements solve a common data synchronization problem: you need to insert a row if it doesn’t exist, or update it if it does. The naive approach—checking existence with SELECT, then branching to INSERT or UPDATE—creates a race condition window where concurrent operations can cause duplicate key violations or lost updates.
A MERGE operation atomically matches source data against a target table based on specified conditions. For matched rows, it performs updates. For unmatched rows, it performs inserts. Some implementations support deleting unmatched target rows.
-- SQL Server MERGE syntax
MERGE INTO target_table AS target
USING source_table AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.updated_at = GETDATE()
WHEN NOT MATCHED BY TARGET THEN
INSERT (id, name, created_at, updated_at)
VALUES (source.id, source.name, GETDATE(), GETDATE())
WHEN NOT MATCHED BY SOURCE THEN
DELETE;
This single statement handles all three scenarios atomically, with proper locking to prevent concurrent modification issues.
SQL Server MERGE Implementation
SQL Server’s MERGE statement is the most feature-complete implementation, supporting complex matching conditions and multiple WHEN clauses.
-- Product inventory synchronization
MERGE INTO inventory AS inv
USING (
SELECT product_id, quantity, warehouse_id, last_updated
FROM staging_inventory
WHERE import_batch_id = @BatchId
) AS staging
ON inv.product_id = staging.product_id
AND inv.warehouse_id = staging.warehouse_id
WHEN MATCHED AND staging.last_updated > inv.last_updated THEN
UPDATE SET
inv.quantity = staging.quantity,
inv.last_updated = staging.last_updated,
inv.modified_by = SYSTEM_USER
WHEN NOT MATCHED BY TARGET THEN
INSERT (product_id, warehouse_id, quantity, last_updated, created_by)
VALUES (staging.product_id, staging.warehouse_id, staging.quantity,
staging.last_updated, SYSTEM_USER)
WHEN NOT MATCHED BY SOURCE AND inv.warehouse_id = @WarehouseId THEN
DELETE
OUTPUT $action, inserted.*, deleted.*;
The OUTPUT clause captures what happened during the MERGE—crucial for auditing and debugging. The $action variable indicates whether the operation was INSERT, UPDATE, or DELETE.
Key considerations for SQL Server MERGE:
- Always terminate with a semicolon (required for MERGE statements)
- Index the join columns in both source and target tables
- Use additional predicates in WHEN clauses to implement business logic
- Be aware of the Halloween Problem when updating the same table used in the source
PostgreSQL INSERT…ON CONFLICT
PostgreSQL took a different approach with INSERT...ON CONFLICT, focusing on the most common use case: insert or update based on unique constraint violations.
-- User profile upsert
INSERT INTO user_profiles (user_id, display_name, bio, avatar_url, updated_at)
VALUES ($1, $2, $3, $4, NOW())
ON CONFLICT (user_id)
DO UPDATE SET
display_name = EXCLUDED.display_name,
bio = EXCLUDED.bio,
avatar_url = EXCLUDED.avatar_url,
updated_at = NOW()
WHERE user_profiles.updated_at < EXCLUDED.updated_at;
The EXCLUDED keyword references the row that would have been inserted. The optional WHERE clause on DO UPDATE prevents updates when conditions aren’t met—in this case, preventing overwrites with stale data.
For composite keys and partial indexes:
-- Event logging with deduplication
INSERT INTO event_log (event_type, user_id, resource_id, event_data, occurred_at)
VALUES ($1, $2, $3, $4, $5)
ON CONFLICT (event_type, user_id, resource_id, occurred_at)
WHERE event_type IN ('login', 'logout')
DO NOTHING;
The conflict target can specify a partial unique index, allowing sophisticated deduplication logic.
PostgreSQL MERGE (9.5+)
PostgreSQL 15 introduced full MERGE support for compatibility with SQL standards:
MERGE INTO customer_balances AS cb
USING daily_transactions AS dt
ON cb.customer_id = dt.customer_id
WHEN MATCHED THEN
UPDATE SET
balance = cb.balance + dt.amount,
last_transaction = dt.transaction_date
WHEN NOT MATCHED THEN
INSERT (customer_id, balance, last_transaction)
VALUES (dt.customer_id, dt.amount, dt.transaction_date);
For most PostgreSQL use cases, INSERT…ON CONFLICT remains simpler and more performant for single-table upserts.
MySQL Approaches
MySQL offers multiple upsert mechanisms, each with different semantics.
INSERT…ON DUPLICATE KEY UPDATE:
INSERT INTO page_views (page_id, view_date, view_count, unique_visitors)
VALUES (?, CURDATE(), 1, 1)
ON DUPLICATE KEY UPDATE
view_count = view_count + VALUES(view_count),
unique_visitors = unique_visitors + VALUES(unique_visitors);
This requires a UNIQUE index or PRIMARY KEY on the conflict columns. The VALUES() function references the value from the INSERT clause.
REPLACE statement:
REPLACE INTO user_sessions (session_id, user_id, last_activity, session_data)
VALUES (?, ?, NOW(), ?);
REPLACE deletes the existing row and inserts a new one. This has important implications:
- Auto-increment IDs change
- Triggers fire for DELETE then INSERT
- Foreign key constraints may cascade
- Performance impact from deletion
Use REPLACE only when you truly want delete-then-insert semantics.
Performance Considerations
Benchmark results vary dramatically based on workload characteristics:
-- Create test table with proper indexing
CREATE TABLE metrics (
metric_id INT PRIMARY KEY,
metric_value DECIMAL(10,2),
updated_at TIMESTAMP,
INDEX idx_updated (updated_at)
);
-- Bulk upsert pattern (PostgreSQL)
INSERT INTO metrics (metric_id, metric_value, updated_at)
SELECT id, value, NOW()
FROM staging_metrics
ON CONFLICT (metric_id)
DO UPDATE SET
metric_value = EXCLUDED.metric_value,
updated_at = EXCLUDED.updated_at;
For bulk operations:
- Batch upserts in transactions (1000-10000 rows per batch)
- Disable indexes on empty tables, bulk insert, then rebuild indexes
- Use UNLOGGED tables (PostgreSQL) for staging data
- Consider partitioning for tables exceeding 100M rows
- Monitor execution plans—optimizer may choose full table scans over index seeks
Handling Concurrency
MERGE operations use table-level or row-level locks depending on isolation level and database implementation:
-- SQL Server: explicit locking hints
MERGE INTO inventory WITH (HOLDLOCK) AS target
USING source_data AS source
ON target.product_id = source.product_id
WHEN MATCHED THEN UPDATE SET quantity = source.quantity
WHEN NOT MATCHED THEN INSERT (product_id, quantity)
VALUES (source.product_id, source.quantity);
For high-concurrency scenarios:
- Use READ COMMITTED isolation for most workloads
- Implement retry logic for deadlock victims
- Keep transactions short
- Consider queue-based architectures for write-heavy workloads
- Use optimistic locking with version columns when appropriate
Common Pitfalls
Updating the matching column:
-- WRONG: Creates logical inconsistency
MERGE INTO products AS p
USING updates AS u
ON p.product_id = u.old_product_id
WHEN MATCHED THEN
UPDATE SET product_id = u.new_product_id; -- Changes join key
Never update columns used in the ON clause—this violates MERGE semantics.
Missing semicolon in SQL Server:
-- WRONG: Syntax error
UPDATE settings SET value = 1 WHERE id = 1
MERGE INTO target... -- Error: MERGE requires preceding statement to end with semicolon
Always terminate the previous statement with a semicolon before MERGE.
Assuming MERGE is always faster:
For single-row operations, a simple UPDATE followed by conditional INSERT often outperforms MERGE due to lower parsing overhead. Profile your specific workload.