SQL String Functions: CONCAT, SUBSTRING, TRIM, REPLACE
String manipulation is one of the most common tasks in SQL, whether you're cleaning imported data, formatting output for reports, or standardizing user input. While modern ORMs and application...
Key Insights
- String functions like CONCAT, SUBSTRING, TRIM, and REPLACE are essential for data cleaning and transformation directly in SQL, eliminating the need for application-level processing in many cases
- Each major database system (PostgreSQL, MySQL, SQL Server, Oracle) implements these functions with slight syntax variations—understanding these differences prevents migration headaches
- Combining multiple string functions in a single query enables powerful data transformations, but overuse can impact performance on large datasets where application-level processing might be more efficient
Introduction
String manipulation is one of the most common tasks in SQL, whether you’re cleaning imported data, formatting output for reports, or standardizing user input. While modern ORMs and application frameworks offer string manipulation capabilities, performing these operations directly in SQL often proves more efficient—especially when dealing with large datasets or generating complex reports.
The four functions covered here—CONCAT, SUBSTRING, TRIM, and REPLACE—form the foundation of SQL string manipulation. Master these, and you’ll handle 90% of real-world string transformation scenarios. However, be aware that different database systems implement these functions with varying syntax. I’ll focus primarily on standard SQL while noting important database-specific variations for PostgreSQL, MySQL, SQL Server, and Oracle.
CONCAT: Combining Strings
CONCAT merges two or more strings into a single value. This is fundamental for creating readable output from normalized database structures where data is split across multiple columns.
The basic syntax is straightforward:
SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM employees;
This combines first and last names with a space separator. For building more complex strings like full addresses, you can chain multiple values:
SELECT CONCAT(street_address, ', ', city, ', ', state, ' ', zip_code) AS full_address
FROM customers;
Here’s where things get tricky: NULL handling varies by database. In standard SQL and MySQL, if any argument to CONCAT is NULL, the entire result becomes NULL:
-- Returns NULL if middle_name is NULL
SELECT CONCAT(first_name, ' ', middle_name, ' ', last_name) AS full_name
FROM employees;
PostgreSQL and Oracle use the || concatenation operator, which also returns NULL if any operand is NULL:
-- PostgreSQL/Oracle syntax
SELECT first_name || ' ' || last_name AS full_name
FROM employees;
SQL Server’s + operator behaves similarly. To handle NULLs gracefully, wrap potentially NULL columns with COALESCE:
SELECT CONCAT(first_name, ' ', COALESCE(middle_name, ''), ' ', last_name) AS full_name
FROM employees;
Better yet, use CONCAT_WS (CONCAT With Separator), which automatically skips NULL values and inserts a separator between non-NULL arguments:
SELECT CONCAT_WS(' ', first_name, middle_name, last_name) AS full_name
FROM employees;
This is cleaner and more reliable for building delimited strings. CONCAT_WS is available in MySQL, PostgreSQL, and SQL Server 2017+.
SUBSTRING: Extracting Portions of Text
SUBSTRING extracts a portion of a string based on position and length. The standard syntax is:
SUBSTRING(string, start_position, length)
Note that SQL uses 1-based indexing (the first character is position 1, not 0 like most programming languages).
Extract area codes from phone numbers stored as 10-digit strings:
SELECT
phone_number,
SUBSTRING(phone_number, 1, 3) AS area_code,
SUBSTRING(phone_number, 4, 3) AS exchange,
SUBSTRING(phone_number, 7, 4) AS line_number
FROM contacts;
For preview text in a blog system, grab the first 100 characters:
SELECT
title,
SUBSTRING(content, 1, 100) AS preview
FROM blog_posts;
Extract middle portions by specifying both position and length:
-- Extract month from YYYYMMDD date string
SELECT SUBSTRING(date_string, 5, 2) AS month
FROM legacy_dates;
Database-specific variations exist. Oracle uses SUBSTR (no “ING”):
-- Oracle
SELECT SUBSTR(phone_number, 1, 3) AS area_code
FROM contacts;
MySQL and PostgreSQL support both SUBSTRING and SUBSTR. For convenience, most databases also provide LEFT and RIGHT functions:
-- Get first 3 characters
SELECT LEFT(phone_number, 3) AS area_code
FROM contacts;
-- Get last 4 characters
SELECT RIGHT(phone_number, 4) AS line_number
FROM contacts;
TRIM: Removing Whitespace
TRIM removes unwanted characters (typically whitespace) from the beginning and/or end of strings. This is critical for cleaning user input and imported data.
Basic usage removes leading and trailing spaces:
SELECT TRIM(user_input) AS cleaned_input
FROM form_submissions;
Use LTRIM and RTRIM for left-only or right-only trimming:
SELECT
LTRIM(column_name) AS left_trimmed,
RTRIM(column_name) AS right_trimmed,
TRIM(column_name) AS both_trimmed
FROM messy_data;
When importing CSV data or dealing with legacy systems, you often encounter inconsistent spacing:
-- Before: ' John ', 'Mary ', ' Bob'
-- After: 'John', 'Mary', 'Bob'
UPDATE customers
SET first_name = TRIM(first_name);
TRIM can also remove specific characters. The standard SQL syntax allows you to specify what to trim:
-- Remove leading/trailing dots
SELECT TRIM('.' FROM product_code) AS cleaned_code
FROM products;
-- PostgreSQL also supports BOTH, LEADING, TRAILING keywords
SELECT TRIM(BOTH '0' FROM account_number) AS trimmed_account
FROM accounts;
For thorough cleaning, nest TRIM operations:
-- Remove spaces, then remove specific characters
SELECT TRIM('*' FROM TRIM(description)) AS cleaned_description
FROM items;
REPLACE: Finding and Substituting Text
REPLACE searches for all occurrences of a substring and replaces them with another string. The syntax is consistent across most databases:
REPLACE(string, search_string, replacement_string)
Standardize phone number formats by removing common separators:
SELECT
phone_number,
REPLACE(REPLACE(REPLACE(phone_number, '-', ''), '(', ''), ')', '') AS digits_only
FROM contacts;
This chains three REPLACE calls to remove hyphens, opening parentheses, and closing parentheses.
Mask sensitive data for reports:
-- Show only last 4 digits of credit card
SELECT
CONCAT(
REPEAT('*', LENGTH(credit_card) - 4),
RIGHT(credit_card, 4)
) AS masked_card
FROM payment_methods;
-- Or replace specific positions
SELECT
REPLACE(ssn, SUBSTRING(ssn, 1, 5), '*****') AS masked_ssn
FROM employees;
Fix common data entry errors in bulk:
-- Standardize state abbreviations
UPDATE addresses
SET state = REPLACE(REPLACE(state, 'California', 'CA'), 'calif', 'CA');
REPLACE is case-sensitive in most databases. For case-insensitive replacement, combine with LOWER or UPPER:
-- Replace regardless of case (simple approach)
UPDATE products
SET description = REPLACE(LOWER(description), 'old_term', 'new_term');
Practical Use Case: Complete Data Transformation
Here’s a realistic scenario: you’ve imported customer data from multiple sources, and it’s a mess. Phone numbers have inconsistent formats, names have extra whitespace, and addresses need standardization.
SELECT
customer_id,
-- Clean and format name
CONCAT_WS(' ',
TRIM(first_name),
TRIM(last_name)
) AS full_name,
-- Standardize phone to (XXX) XXX-XXXX format
CONCAT(
'(',
SUBSTRING(REPLACE(REPLACE(REPLACE(TRIM(phone), '-', ''), '(', ''), ')', ''), 1, 3),
') ',
SUBSTRING(REPLACE(REPLACE(REPLACE(TRIM(phone), '-', ''), '(', ''), ')', ''), 4, 3),
'-',
SUBSTRING(REPLACE(REPLACE(REPLACE(TRIM(phone), '-', ''), '(', ''), ')', ''), 7, 4)
) AS formatted_phone,
-- Clean and standardize address
CONCAT_WS(', ',
TRIM(street_address),
TRIM(city),
CONCAT(TRIM(state), ' ', TRIM(zip_code))
) AS formatted_address,
-- Clean email (lowercase, trim)
LOWER(TRIM(email)) AS cleaned_email
FROM raw_customer_imports
WHERE LENGTH(REPLACE(REPLACE(REPLACE(TRIM(phone), '-', ''), '(', ''), ')', '')) = 10;
This single query transforms messy data into a consistent format. The WHERE clause filters for valid 10-digit phone numbers after stripping formatting.
Performance Considerations & Best Practices
String functions are computationally expensive compared to simple column retrieval. Here’s when and how to use them wisely:
Indexing: String functions prevent index usage. This query can’t use an index on email:
SELECT * FROM users WHERE LOWER(email) = 'user@example.com';
Instead, store a computed column or use a functional index (PostgreSQL, Oracle):
-- PostgreSQL
CREATE INDEX idx_email_lower ON users (LOWER(email));
Application vs. Database: For one-time transformations or small datasets, SQL string functions are fine. For repeated formatting in high-traffic applications, consider doing it in application code where you can cache results and avoid repeated database computation.
Bulk Operations: When cleaning large datasets, update in batches:
-- Update in chunks to avoid long-running transactions
UPDATE customers
SET phone = REPLACE(REPLACE(phone, '-', ''), '(', '')
WHERE customer_id BETWEEN 1 AND 10000;
Database-Specific Optimizations: Some databases offer optimized string functions. PostgreSQL’s regexp_replace() can replace multiple patterns in one pass. SQL Server’s STRING_AGG() efficiently concatenates grouped rows.
Use string functions for data cleaning, one-time migrations, and report generation. For high-frequency operations, consider computed columns, triggers, or application-level processing. Always test performance on production-scale data before deploying string-heavy queries to production systems.