How to Use Regular Expressions in PostgreSQL

Key Insights

PostgreSQL offers four regex operators (~, ~*, !~, !~*) plus powerful functions like regexp_match(), regexp_replace(), and regexp_split_to_table() that go far beyond basic LIKE patterns
Use regexp_match() for single matches with capture groups and regexp_matches() with the ‘g’ flag when you need all occurrences from a string
Regex queries can be slow on large tables—use functional indexes with pg_trgm extension or consider full-text search for better performance on pattern-heavy workloads

Introduction to PostgreSQL Regular Expression Operators

PostgreSQL supports POSIX regular expressions, giving you far more flexibility than simple LIKE patterns. While LIKE is limited to % (any characters) and _ (single character), regex operators let you match complex patterns with precision.

The four basic operators are:

~ matches pattern (case-sensitive)
~* matches pattern (case-insensitive)
!~ doesn’t match pattern (case-sensitive)
!~* doesn’t match pattern (case-insensitive)

Here’s a direct comparison:

-- LIKE: limited pattern matching
SELECT * FROM users WHERE email LIKE '%@gmail.com';

-- Regex: more precise control
SELECT * FROM users WHERE email ~ '^[A-Za-z0-9._%+-]+@gmail\.com$';

-- Case-insensitive match
SELECT * FROM products WHERE name ~* '^(premium|deluxe)';

-- Negative match: find non-standard phone formats
SELECT * FROM contacts WHERE phone !~ '^\d{3}-\d{3}-\d{4}$';

The regex approach validates email structure properly, while LIKE would match invalid entries like “abc@gmail.com.fake.com”.

Basic Pattern Matching with POSIX Regular Expressions

PostgreSQL uses POSIX Extended Regular Expressions (ERE). The fundamental building blocks include:

Character classes: [A-Z], [0-9], [a-z]
Quantifiers: * (0+), + (1+), ? (0 or 1), {n,m} (between n and m)
Anchors: ^ (start), $ (end)
Wildcards: . (any character)

Let’s validate email addresses:

SELECT 
    email,
    email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$' AS is_valid
FROM users;

For phone number pattern matching:

-- Find US phone numbers in various formats
SELECT * FROM contacts
WHERE phone ~ '^\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}$';

-- Extract records with product codes (e.g., "PRD-12345")
SELECT * FROM inventory
WHERE sku ~ '^[A-Z]{3}-[0-9]{5}$';

-- Find records with URLs
SELECT * FROM posts
WHERE content ~ 'https?://[^\s]+';

The phone pattern handles formats like “555-123-4567”, “(555) 123-4567”, and “5551234567”.

Advanced Pattern Matching with regexp_match() and regexp_matches()

The regexp_match() function returns the first match as a text array, while regexp_matches() can return all matches when you use the ‘g’ (global) flag. Both support capture groups with parentheses.

Extract domain names from emails:

SELECT 
    email,
    (regexp_match(email, '@([A-Za-z0-9.-]+\.[A-Za-z]{2,})'))[1] AS domain
FROM users;

The [1] index accesses the first capture group (the parenthesized part).

Parse structured data from text fields:

-- Extract area code, exchange, and number from phone strings
SELECT 
    phone,
    (regexp_match(phone, '(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})'))[1] AS area_code,
    (regexp_match(phone, '(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})'))[2] AS exchange,
    (regexp_match(phone, '(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})'))[3] AS number
FROM contacts;

Using regexp_matches() to find all occurrences:

-- Find all hashtags in social media posts
SELECT 
    post_id,
    regexp_matches(content, '#([A-Za-z0-9_]+)', 'g') AS hashtags
FROM posts;

-- Extract all email addresses from a text field
SELECT 
    id,
    regexp_matches(notes, '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}', 'g') AS emails
FROM customer_notes;

The ‘g’ flag makes regexp_matches() return one row per match, useful for unnesting multiple occurrences.

String Replacement with regexp_replace()

The regexp_replace() function finds and replaces text using patterns. It supports backreferences (\1, \2, etc.) to reuse captured groups.

Format phone numbers consistently:

-- Convert various formats to (555) 123-4567
UPDATE contacts
SET phone = regexp_replace(
    phone,
    '^(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})$',
    '(\1) \2-\3'
)
WHERE phone ~ '^\d{3}[^\d]*\d{3}[^\d]*\d{4}$';

Redact sensitive information:

-- Mask credit card numbers, showing only last 4 digits
SELECT 
    regexp_replace(
        payment_info,
        '\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?(\d{4})',
        '****-****-****-\1'
    ) AS masked_card
FROM transactions;

-- Redact SSNs
UPDATE records
SET ssn = regexp_replace(ssn, '\d{3}-\d{2}-(\d{4})', 'XXX-XX-\1')
WHERE ssn IS NOT NULL;

Clean and normalize data:

-- Remove multiple spaces
UPDATE products
SET description = regexp_replace(description, '\s+', ' ', 'g');

-- Strip HTML tags
SELECT regexp_replace(content, '<[^>]+>', '', 'g') AS plain_text
FROM articles;

-- Normalize whitespace and punctuation
UPDATE comments
SET text = regexp_replace(
    regexp_replace(text, '\s+', ' ', 'g'),
    '[.!?]+',
    '. ',
    'g'
);

Splitting and Extracting with regexp_split_to_table() and regexp_split_to_array()

These functions break strings into pieces based on regex delimiters. Use regexp_split_to_array() for array results and regexp_split_to_table() to create rows.

Split comma-separated values with variable whitespace:

-- Split tags into an array
SELECT 
    id,
    regexp_split_to_array(tags, '\s*,\s*') AS tag_array
FROM articles;

-- Create one row per tag
SELECT 
    id,
    regexp_split_to_table(tags, '\s*,\s*') AS tag
FROM articles;

Parse log entries:

-- Split Apache log format into components
SELECT 
    regexp_split_to_array(
        log_line,
        '\s+'
    ) AS log_parts
FROM server_logs;

-- Extract specific fields from structured logs
WITH parsed AS (
    SELECT 
        id,
        regexp_split_to_array(log_entry, '\s*\|\s*') AS fields
    FROM application_logs
)
SELECT 
    id,
    fields[1] AS timestamp,
    fields[2] AS level,
    fields[3] AS message
FROM parsed;

Convert delimited strings to normalized tables:

-- Normalize a comma-separated skills column
INSERT INTO user_skills (user_id, skill)
SELECT 
    u.id,
    regexp_split_to_table(u.skills, '\s*,\s*')
FROM users u
WHERE u.skills IS NOT NULL;

Performance Considerations and Best Practices

Regex queries can be expensive. PostgreSQL must check each row’s value against the pattern, which doesn’t benefit from standard B-tree indexes.

Create a functional index for common patterns:

-- Enable pg_trgm extension for trigram indexes
CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- Create a GIN index for pattern matching
CREATE INDEX idx_users_email_trgm ON users USING gin (email gin_trgm_ops);

-- This can now use the index for some regex queries
SELECT * FROM users WHERE email ~ '@gmail\.com$';

Functional indexes for specific patterns:

-- Index on extracted domain
CREATE INDEX idx_email_domain ON users (
    (regexp_match(email, '@([A-Za-z0-9.-]+)'))[1]
);

Performance comparison:

-- Slowest: Complex regex on large table
EXPLAIN ANALYZE
SELECT * FROM logs WHERE message ~ '^ERROR.*database.*timeout';

-- Faster: Use LIKE when possible
EXPLAIN ANALYZE
SELECT * FROM logs WHERE message LIKE 'ERROR%' AND message LIKE '%database%';

-- Fastest for text search: Use full-text search
CREATE INDEX idx_logs_message_fts ON logs USING gin (to_tsvector('english', message));

EXPLAIN ANALYZE
SELECT * FROM logs WHERE to_tsvector('english', message) @@ to_tsquery('error & database & timeout');

Best practices:

Use simpler operators when possible: LIKE or = are faster than regex
Anchor patterns with ^ and $ to avoid scanning entire strings
Compile frequently-used patterns into stored functions
Consider full-text search for complex text queries
Test regex patterns on small datasets first—bad patterns can lock tables

Real-World Use Cases

Here’s a complete data validation pipeline:

-- Validate and clean user input in one pass
WITH validated AS (
    SELECT 
        id,
        email,
        phone,
        CASE 
            WHEN email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}$' 
            THEN email 
            ELSE NULL 
        END AS clean_email,
        CASE 
            WHEN phone ~ '^\d{3}[^\d]*\d{3}[^\d]*\d{4}$'
            THEN regexp_replace(phone, '^(\d{3})[^\d]*(\d{3})[^\d]*(\d{4})$', '\1-\2-\3')
            ELSE NULL
        END AS clean_phone
    FROM user_imports
)
INSERT INTO users (email, phone)
SELECT clean_email, clean_phone
FROM validated
WHERE clean_email IS NOT NULL AND clean_phone IS NOT NULL;

ETL process for cleaning imported data:

-- Multi-step data cleaning
UPDATE imported_products
SET 
    -- Normalize SKUs
    sku = UPPER(regexp_replace(sku, '[^A-Z0-9-]', '', 'g')),
    -- Clean prices (remove currency symbols and commas)
    price = regexp_replace(price, '[^0-9.]', '', 'g')::numeric,
    -- Standardize descriptions
    description = regexp_replace(
        regexp_replace(description, '<[^>]+>', '', 'g'),
        '\s+', ' ', 'g'
    ),
    -- Extract and normalize URLs
    product_url = (regexp_match(raw_data, 'https?://[^\s]+'))[1]
WHERE imported_at > NOW() - INTERVAL '1 day';

Regular expressions in PostgreSQL are powerful tools for data validation, cleaning, and extraction. Master these operators and functions, understand their performance characteristics, and you’ll handle complex text processing directly in your database with confidence.