Linux Text Processing with cut, sort, uniq, and wc
Linux text processing commands are the Swiss Army knife of data analysis. While modern tools like `jq` and Python scripts have their place, the classic utilities—`cut`, `sort`, `uniq`, and...
Key Insights
- Master
cut,sort,uniq, andwcto build powerful text processing pipelines that replace complex scripts with single-line commands - Always pipe data through
sortbefore usinguniq—uniq only detects adjacent duplicates, a mistake that causes silent data corruption - Command order matters for performance: filter with
cutfirst to reduce data volume, thensort, thenuniqto minimize memory usage
The Power of Text Processing Pipelines
Linux text processing commands are the Swiss Army knife of data analysis. While modern tools like jq and Python scripts have their place, the classic utilities—cut, sort, uniq, and wc—remain unmatched for quick data exploration, log analysis, and system monitoring. They’re fast, memory-efficient, and available on every Unix-like system.
The real power emerges when you pipe these commands together. Here’s a glimpse of what’s possible:
# Analyze Apache access logs: find top 10 IP addresses by request count
cut -d' ' -f1 access.log | sort | uniq -c | sort -rn | head -10
This single line extracts IP addresses, counts occurrences, and ranks them—a task that would require dozens of lines in most programming languages. Let’s break down each command and build toward complex real-world scenarios.
The cut Command: Extracting Columns and Fields
The cut command extracts specific portions of each line from a file or stream. Think of it as selecting columns in a spreadsheet, but for text files.
The most common use case is extracting fields from delimited data:
# Extract usernames from /etc/passwd (fields are colon-delimited)
cut -d':' -f1 /etc/passwd
# Extract email domain from a list
echo "user@example.com" | cut -d'@' -f2
The -d flag specifies the delimiter, and -f specifies which field(s) to extract. You can select multiple fields:
# Extract username and home directory (fields 1 and 6)
cut -d':' -f1,6 /etc/passwd
# Extract a range of fields
cut -d',' -f2-5 data.csv
For fixed-width data, use character or byte positions:
# Extract characters 1-10 from each line
cut -c1-10 file.txt
# Extract bytes (useful for multi-byte characters)
cut -b1-20 file.txt
Here’s a practical example parsing CSV data:
# Sample sales data: date,product,quantity,price
cat sales.csv
# 2024-01-15,Widget,5,29.99
# 2024-01-15,Gadget,3,49.99
# 2024-01-16,Widget,2,29.99
# Extract just product names
cut -d',' -f2 sales.csv
Pro tip: cut doesn’t handle quoted CSV fields with embedded delimiters. For complex CSV parsing, use csvkit or awk.
The sort Command: Ordering Your Data
The sort command does exactly what you’d expect, but with powerful options that handle various data types and sorting strategies.
Basic alphabetical sorting is the default:
sort names.txt
For numeric data, use -n to sort numerically instead of lexicographically:
# Wrong: lexicographic sort treats "10" as less than "2"
echo -e "10\n2\n100\n20" | sort
# Output: 10, 100, 2, 20
# Right: numeric sort
echo -e "10\n2\n100\n20" | sort -n
# Output: 2, 10, 20, 100
Reverse the order with -r:
# Find largest files in current directory
ls -l | tail -n +2 | sort -k5 -rn | head -5
The -k flag sorts by a specific field, which is crucial for structured data:
# Sort by third field numerically
cat data.txt
# Alice 25 Engineer
# Bob 30 Designer
# Carol 22 Manager
sort -k2 -n data.txt
# Carol 22 Manager
# Alice 25 Engineer
# Bob 30 Designer
You can sort by multiple fields:
# Sort by department (field 3), then by age (field 2) numerically
sort -k3,3 -k2,2n employees.txt
Case-insensitive sorting uses -f:
sort -f mixed_case.txt
The -u flag removes duplicates while sorting, combining sort and uniq:
# Get unique sorted values in one pass
sort -u duplicates.txt
The uniq Command: Finding and Counting Duplicates
The uniq command filters out adjacent duplicate lines. This is critical: uniq only works on sorted input. This is the most common mistake beginners make.
# Wrong: unsorted input misses duplicates
echo -e "apple\nbanana\napple" | uniq
# Output: apple, banana, apple (apple appears twice!)
# Right: sort first
echo -e "apple\nbanana\napple" | sort | uniq
# Output: apple, banana
Count occurrences with -c:
# Count frequency of each line
echo -e "apple\nbanana\napple\napple\nbanana" | sort | uniq -c
# 3 apple
# 2 banana
Show only duplicates with -d:
# Find which items appear more than once
sort data.txt | uniq -d
Show only unique lines (appearing exactly once) with -u:
# Find items that appear only once
sort data.txt | uniq -u
Here’s a practical example finding duplicate IP addresses in a log:
# Extract IPs, find duplicates, count them
cut -d' ' -f1 access.log | sort | uniq -d -c | sort -rn
The wc Command: Counting Lines, Words, and Characters
The wc (word count) command provides quick statistics about text files.
# Count lines, words, and characters
wc file.txt
# 42 291 1847 file.txt
# (lines, words, bytes, filename)
Use specific flags for targeted counts:
# Count only lines
wc -l file.txt
# Count only words
wc -w file.txt
# Count only characters/bytes
wc -c file.txt
# Count characters (handles multi-byte UTF-8 correctly)
wc -m file.txt
Common use cases:
# How many users on the system?
wc -l /etc/passwd
# How many Python files in this project?
find . -name "*.py" | wc -l
# Total lines of code (excluding blank lines)
find . -name "*.py" -exec cat {} \; | grep -v '^$' | wc -l
# How many unique IP addresses in the log?
cut -d' ' -f1 access.log | sort -u | wc -l
Real-World Pipeline Examples
Now let’s combine these commands to solve actual problems.
Finding top 10 error messages in application logs:
# Log format: timestamp [LEVEL] message
grep ERROR app.log | cut -d']' -f2- | sort | uniq -c | sort -rn | head -10
Analyzing web server access patterns:
# Find top 10 requested URLs
awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -10
# Find top user agents
cut -d'"' -f6 access.log | sort | uniq -c | sort -rn | head -10
# Count requests by hour
cut -d'[' -f2 | cut -d':' -f1-2 access.log | sort | uniq -c
Generating user activity report:
# Format: username,action,timestamp
# Count actions per user
cut -d',' -f1 activity.csv | sort | uniq -c | sort -rn
# Find users with more than 100 actions
cut -d',' -f1 activity.csv | sort | uniq -c | awk '$1 > 100'
Processing CSV for summary statistics:
# Calculate total quantity sold per product
cat sales.csv | tail -n +2 | cut -d',' -f2,3 | \
awk -F',' '{sum[$1]+=$2} END {for (p in sum) print p, sum[p]}' | \
sort -k2 -rn
Finding duplicate files by size:
# Find potentially duplicate files
find . -type f -exec ls -l {} \; | \
awk '{print $5, $9}' | \
sort -n | \
uniq -w10 -D
Performance Tips and Best Practices
Command order significantly impacts performance. Always filter data as early as possible:
# Slow: processes entire file through sort and uniq
sort huge_log.txt | uniq | grep ERROR
# Fast: filters first, reducing data volume
grep ERROR huge_log.txt | sort | uniq
Use cut early to reduce data width:
# Better: extract needed fields first
cut -d',' -f2 huge_file.csv | sort | uniq -c
# Worse: sort entire lines
sort huge_file.csv | cut -d',' -f2 | uniq -c
For very large files, consider these alternatives:
- Use
sort -uinstead ofsort | uniqfor better performance - For counting,
awkcan be faster thansort | uniq -con unsorted data - Consider
parallelfor multi-core processing of large datasets
Common pitfalls to avoid:
- Forgetting to sort before uniq: This silently produces wrong results
- Using lexicographic sort for numbers: Always use
sort -nfor numeric data - Assuming cut handles quoted CSV: It doesn’t—use proper CSV tools
- Not considering locale: Sorting behavior changes with
LC_COLLATE
Set consistent locale for predictable sorting:
export LC_ALL=C
sort data.txt # Now uses byte-order sorting
These four commands form the foundation of command-line data processing. Master them, and you’ll find yourself solving complex data problems with elegant one-liners instead of writing throwaway scripts. The key is practice—start incorporating these into your daily workflow, and they’ll become second nature.