How to Use Regular Expressions in PostgreSQL
PostgreSQL supports POSIX regular expressions, giving you far more flexibility than simple LIKE patterns. While `LIKE` is limited to `%` (any characters) and `_` (single character), regex operators...
Key Insights
- PostgreSQL offers four regex operators (
~,~*,!~,!~*) plus powerful functions likeregexp_match(),regexp_replace(), andregexp_split_to_table()that go far beyond basic LIKE patterns - Use
regexp_match()for single matches with capture groups andregexp_matches()with the ‘g’ flag when you need all occurrences from a string - Regex queries can be slow on large tables—use functional indexes with
pg_trgmextension or consider full-text search for better performance on pattern-heavy workloads
Introduction to PostgreSQL Regular Expression Operators
PostgreSQL supports POSIX regular expressions, giving you far more flexibility than simple LIKE patterns. While LIKE is limited to % (any characters) and _ (single character), regex operators let you match complex patterns with precision.
The four basic operators are:
~matches pattern (case-sensitive)~*matches pattern (case-insensitive)!~doesn’t match pattern (case-sensitive)!~*doesn’t match pattern (case-insensitive)
Here’s a direct comparison:
-- LIKE: limited pattern matching
SELECT * FROM users WHERE email LIKE '%@gmail.com';
-- Regex: more precise control
SELECT * FROM users WHERE email ~ '^[A-Za-z0-9._%+-]+@gmail\.com$';
-- Case-insensitive match
SELECT * FROM products WHERE name ~* '^(premium|deluxe)';
-- Negative match: find non-standard phone formats
SELECT * FROM contacts WHERE phone !~ '^\d{3}-\d{3}-\d{4}$';
The regex approach validates email structure properly, while LIKE would match invalid entries like “abc@gmail.com.fake.com”.
Basic Pattern Matching with POSIX Regular Expressions
PostgreSQL uses POSIX Extended Regular Expressions (ERE). The fundamental building blocks include:
- Character classes:
[A-Z],[0-9],[a-z] - Quantifiers:
*(0+),+(1+),?(0 or 1),{n,m}(between n and m) - Anchors:
^(start),$(end) - Wildcards:
.(any character)
Let’s validate email addresses:
SELECT
email,
email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$' AS is_valid
FROM users;
For phone number pattern matching:
-- Find US phone numbers in various formats
SELECT * FROM contacts
WHERE phone ~ '^\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}$';
-- Extract records with product codes (e.g., "PRD-12345")
SELECT * FROM inventory
WHERE sku ~ '^[A-Z]{3}-[0-9]{5}$';
-- Find records with URLs
SELECT * FROM posts
WHERE content ~ 'https?://[^\s]+';
The phone pattern handles formats like “555-123-4567”, “(555) 123-4567”, and “5551234567”.
Advanced Pattern Matching with regexp_match() and regexp_matches()
The regexp_match() function returns the first match as a text array, while regexp_matches() can return all matches when you use the ‘g’ (global) flag. Both support capture groups with parentheses.
Extract domain names from emails:
SELECT
email,
(regexp_match(email, '@([A-Za-z0-9.-]+\.[A-Za-z]{2,})'))[1] AS domain
FROM users;
The [1] index accesses the first capture group (the parenthesized part).
Parse structured data from text fields:
-- Extract area code, exchange, and number from phone strings
SELECT
phone,
(regexp_match(phone, '(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})'))[1] AS area_code,
(regexp_match(phone, '(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})'))[2] AS exchange,
(regexp_match(phone, '(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})'))[3] AS number
FROM contacts;
Using regexp_matches() to find all occurrences:
-- Find all hashtags in social media posts
SELECT
post_id,
regexp_matches(content, '#([A-Za-z0-9_]+)', 'g') AS hashtags
FROM posts;
-- Extract all email addresses from a text field
SELECT
id,
regexp_matches(notes, '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}', 'g') AS emails
FROM customer_notes;
The ‘g’ flag makes regexp_matches() return one row per match, useful for unnesting multiple occurrences.
String Replacement with regexp_replace()
The regexp_replace() function finds and replaces text using patterns. It supports backreferences (\1, \2, etc.) to reuse captured groups.
Format phone numbers consistently:
-- Convert various formats to (555) 123-4567
UPDATE contacts
SET phone = regexp_replace(
phone,
'^(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})$',
'(\1) \2-\3'
)
WHERE phone ~ '^\d{3}[^\d]*\d{3}[^\d]*\d{4}$';
Redact sensitive information:
-- Mask credit card numbers, showing only last 4 digits
SELECT
regexp_replace(
payment_info,
'\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?(\d{4})',
'****-****-****-\1'
) AS masked_card
FROM transactions;
-- Redact SSNs
UPDATE records
SET ssn = regexp_replace(ssn, '\d{3}-\d{2}-(\d{4})', 'XXX-XX-\1')
WHERE ssn IS NOT NULL;
Clean and normalize data:
-- Remove multiple spaces
UPDATE products
SET description = regexp_replace(description, '\s+', ' ', 'g');
-- Strip HTML tags
SELECT regexp_replace(content, '<[^>]+>', '', 'g') AS plain_text
FROM articles;
-- Normalize whitespace and punctuation
UPDATE comments
SET text = regexp_replace(
regexp_replace(text, '\s+', ' ', 'g'),
'[.!?]+',
'. ',
'g'
);
Splitting and Extracting with regexp_split_to_table() and regexp_split_to_array()
These functions break strings into pieces based on regex delimiters. Use regexp_split_to_array() for array results and regexp_split_to_table() to create rows.
Split comma-separated values with variable whitespace:
-- Split tags into an array
SELECT
id,
regexp_split_to_array(tags, '\s*,\s*') AS tag_array
FROM articles;
-- Create one row per tag
SELECT
id,
regexp_split_to_table(tags, '\s*,\s*') AS tag
FROM articles;
Parse log entries:
-- Split Apache log format into components
SELECT
regexp_split_to_array(
log_line,
'\s+'
) AS log_parts
FROM server_logs;
-- Extract specific fields from structured logs
WITH parsed AS (
SELECT
id,
regexp_split_to_array(log_entry, '\s*\|\s*') AS fields
FROM application_logs
)
SELECT
id,
fields[1] AS timestamp,
fields[2] AS level,
fields[3] AS message
FROM parsed;
Convert delimited strings to normalized tables:
-- Normalize a comma-separated skills column
INSERT INTO user_skills (user_id, skill)
SELECT
u.id,
regexp_split_to_table(u.skills, '\s*,\s*')
FROM users u
WHERE u.skills IS NOT NULL;
Performance Considerations and Best Practices
Regex queries can be expensive. PostgreSQL must check each row’s value against the pattern, which doesn’t benefit from standard B-tree indexes.
Create a functional index for common patterns:
-- Enable pg_trgm extension for trigram indexes
CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- Create a GIN index for pattern matching
CREATE INDEX idx_users_email_trgm ON users USING gin (email gin_trgm_ops);
-- This can now use the index for some regex queries
SELECT * FROM users WHERE email ~ '@gmail\.com$';
Functional indexes for specific patterns:
-- Index on extracted domain
CREATE INDEX idx_email_domain ON users (
(regexp_match(email, '@([A-Za-z0-9.-]+)'))[1]
);
Performance comparison:
-- Slowest: Complex regex on large table
EXPLAIN ANALYZE
SELECT * FROM logs WHERE message ~ '^ERROR.*database.*timeout';
-- Faster: Use LIKE when possible
EXPLAIN ANALYZE
SELECT * FROM logs WHERE message LIKE 'ERROR%' AND message LIKE '%database%';
-- Fastest for text search: Use full-text search
CREATE INDEX idx_logs_message_fts ON logs USING gin (to_tsvector('english', message));
EXPLAIN ANALYZE
SELECT * FROM logs WHERE to_tsvector('english', message) @@ to_tsquery('error & database & timeout');
Best practices:
- Use simpler operators when possible:
LIKEor=are faster than regex - Anchor patterns with
^and$to avoid scanning entire strings - Compile frequently-used patterns into stored functions
- Consider full-text search for complex text queries
- Test regex patterns on small datasets first—bad patterns can lock tables
Real-World Use Cases
Here’s a complete data validation pipeline:
-- Validate and clean user input in one pass
WITH validated AS (
SELECT
id,
email,
phone,
CASE
WHEN email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$'
THEN email
ELSE NULL
END AS clean_email,
CASE
WHEN phone ~ '^\d{3}[^\d]*\d{3}[^\d]*\d{4}$'
THEN regexp_replace(phone, '^(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})$', '\1-\2-\3')
ELSE NULL
END AS clean_phone
FROM user_imports
)
INSERT INTO users (email, phone)
SELECT clean_email, clean_phone
FROM validated
WHERE clean_email IS NOT NULL AND clean_phone IS NOT NULL;
ETL process for cleaning imported data:
-- Multi-step data cleaning
UPDATE imported_products
SET
-- Normalize SKUs
sku = UPPER(regexp_replace(sku, '[^A-Z0-9-]', '', 'g')),
-- Clean prices (remove currency symbols and commas)
price = regexp_replace(price, '[^0-9.]', '', 'g')::numeric,
-- Standardize descriptions
description = regexp_replace(
regexp_replace(description, '<[^>]+>', '', 'g'),
'\s+', ' ', 'g'
),
-- Extract and normalize URLs
product_url = (regexp_match(raw_data, 'https?://[^\s]+'))[1]
WHERE imported_at > NOW() - INTERVAL '1 day';
Regular expressions in PostgreSQL are powerful tools for data validation, cleaning, and extraction. Master these operators and functions, understand their performance characteristics, and you’ll handle complex text processing directly in your database with confidence.