Linux Text Processing with cut, sort, uniq, and wc

Linux text processing commands are the Swiss Army knife of data analysis. While modern tools like `jq` and Python scripts have their place, the classic utilities—`cut`, `sort`, `uniq`, and...

Key Insights

  • Master cut, sort, uniq, and wc to build powerful text processing pipelines that replace complex scripts with single-line commands
  • Always pipe data through sort before using uniq—uniq only detects adjacent duplicates, a mistake that causes silent data corruption
  • Command order matters for performance: filter with cut first to reduce data volume, then sort, then uniq to minimize memory usage

The Power of Text Processing Pipelines

Linux text processing commands are the Swiss Army knife of data analysis. While modern tools like jq and Python scripts have their place, the classic utilities—cut, sort, uniq, and wc—remain unmatched for quick data exploration, log analysis, and system monitoring. They’re fast, memory-efficient, and available on every Unix-like system.

The real power emerges when you pipe these commands together. Here’s a glimpse of what’s possible:

# Analyze Apache access logs: find top 10 IP addresses by request count
cut -d' ' -f1 access.log | sort | uniq -c | sort -rn | head -10

This single line extracts IP addresses, counts occurrences, and ranks them—a task that would require dozens of lines in most programming languages. Let’s break down each command and build toward complex real-world scenarios.

The cut Command: Extracting Columns and Fields

The cut command extracts specific portions of each line from a file or stream. Think of it as selecting columns in a spreadsheet, but for text files.

The most common use case is extracting fields from delimited data:

# Extract usernames from /etc/passwd (fields are colon-delimited)
cut -d':' -f1 /etc/passwd

# Extract email domain from a list
echo "user@example.com" | cut -d'@' -f2

The -d flag specifies the delimiter, and -f specifies which field(s) to extract. You can select multiple fields:

# Extract username and home directory (fields 1 and 6)
cut -d':' -f1,6 /etc/passwd

# Extract a range of fields
cut -d',' -f2-5 data.csv

For fixed-width data, use character or byte positions:

# Extract characters 1-10 from each line
cut -c1-10 file.txt

# Extract bytes (useful for multi-byte characters)
cut -b1-20 file.txt

Here’s a practical example parsing CSV data:

# Sample sales data: date,product,quantity,price
cat sales.csv
# 2024-01-15,Widget,5,29.99
# 2024-01-15,Gadget,3,49.99
# 2024-01-16,Widget,2,29.99

# Extract just product names
cut -d',' -f2 sales.csv

Pro tip: cut doesn’t handle quoted CSV fields with embedded delimiters. For complex CSV parsing, use csvkit or awk.

The sort Command: Ordering Your Data

The sort command does exactly what you’d expect, but with powerful options that handle various data types and sorting strategies.

Basic alphabetical sorting is the default:

sort names.txt

For numeric data, use -n to sort numerically instead of lexicographically:

# Wrong: lexicographic sort treats "10" as less than "2"
echo -e "10\n2\n100\n20" | sort
# Output: 10, 100, 2, 20

# Right: numeric sort
echo -e "10\n2\n100\n20" | sort -n
# Output: 2, 10, 20, 100

Reverse the order with -r:

# Find largest files in current directory
ls -l | tail -n +2 | sort -k5 -rn | head -5

The -k flag sorts by a specific field, which is crucial for structured data:

# Sort by third field numerically
cat data.txt
# Alice 25 Engineer
# Bob 30 Designer
# Carol 22 Manager

sort -k2 -n data.txt
# Carol 22 Manager
# Alice 25 Engineer
# Bob 30 Designer

You can sort by multiple fields:

# Sort by department (field 3), then by age (field 2) numerically
sort -k3,3 -k2,2n employees.txt

Case-insensitive sorting uses -f:

sort -f mixed_case.txt

The -u flag removes duplicates while sorting, combining sort and uniq:

# Get unique sorted values in one pass
sort -u duplicates.txt

The uniq Command: Finding and Counting Duplicates

The uniq command filters out adjacent duplicate lines. This is critical: uniq only works on sorted input. This is the most common mistake beginners make.

# Wrong: unsorted input misses duplicates
echo -e "apple\nbanana\napple" | uniq
# Output: apple, banana, apple (apple appears twice!)

# Right: sort first
echo -e "apple\nbanana\napple" | sort | uniq
# Output: apple, banana

Count occurrences with -c:

# Count frequency of each line
echo -e "apple\nbanana\napple\napple\nbanana" | sort | uniq -c
#   3 apple
#   2 banana

Show only duplicates with -d:

# Find which items appear more than once
sort data.txt | uniq -d

Show only unique lines (appearing exactly once) with -u:

# Find items that appear only once
sort data.txt | uniq -u

Here’s a practical example finding duplicate IP addresses in a log:

# Extract IPs, find duplicates, count them
cut -d' ' -f1 access.log | sort | uniq -d -c | sort -rn

The wc Command: Counting Lines, Words, and Characters

The wc (word count) command provides quick statistics about text files.

# Count lines, words, and characters
wc file.txt
#  42  291 1847 file.txt
# (lines, words, bytes, filename)

Use specific flags for targeted counts:

# Count only lines
wc -l file.txt

# Count only words
wc -w file.txt

# Count only characters/bytes
wc -c file.txt

# Count characters (handles multi-byte UTF-8 correctly)
wc -m file.txt

Common use cases:

# How many users on the system?
wc -l /etc/passwd

# How many Python files in this project?
find . -name "*.py" | wc -l

# Total lines of code (excluding blank lines)
find . -name "*.py" -exec cat {} \; | grep -v '^$' | wc -l

# How many unique IP addresses in the log?
cut -d' ' -f1 access.log | sort -u | wc -l

Real-World Pipeline Examples

Now let’s combine these commands to solve actual problems.

Finding top 10 error messages in application logs:

# Log format: timestamp [LEVEL] message
grep ERROR app.log | cut -d']' -f2- | sort | uniq -c | sort -rn | head -10

Analyzing web server access patterns:

# Find top 10 requested URLs
awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -10

# Find top user agents
cut -d'"' -f6 access.log | sort | uniq -c | sort -rn | head -10

# Count requests by hour
cut -d'[' -f2 | cut -d':' -f1-2 access.log | sort | uniq -c

Generating user activity report:

# Format: username,action,timestamp
# Count actions per user
cut -d',' -f1 activity.csv | sort | uniq -c | sort -rn

# Find users with more than 100 actions
cut -d',' -f1 activity.csv | sort | uniq -c | awk '$1 > 100'

Processing CSV for summary statistics:

# Calculate total quantity sold per product
cat sales.csv | tail -n +2 | cut -d',' -f2,3 | \
  awk -F',' '{sum[$1]+=$2} END {for (p in sum) print p, sum[p]}' | \
  sort -k2 -rn

Finding duplicate files by size:

# Find potentially duplicate files
find . -type f -exec ls -l {} \; | \
  awk '{print $5, $9}' | \
  sort -n | \
  uniq -w10 -D

Performance Tips and Best Practices

Command order significantly impacts performance. Always filter data as early as possible:

# Slow: processes entire file through sort and uniq
sort huge_log.txt | uniq | grep ERROR

# Fast: filters first, reducing data volume
grep ERROR huge_log.txt | sort | uniq

Use cut early to reduce data width:

# Better: extract needed fields first
cut -d',' -f2 huge_file.csv | sort | uniq -c

# Worse: sort entire lines
sort huge_file.csv | cut -d',' -f2 | uniq -c

For very large files, consider these alternatives:

  • Use sort -u instead of sort | uniq for better performance
  • For counting, awk can be faster than sort | uniq -c on unsorted data
  • Consider parallel for multi-core processing of large datasets

Common pitfalls to avoid:

  1. Forgetting to sort before uniq: This silently produces wrong results
  2. Using lexicographic sort for numbers: Always use sort -n for numeric data
  3. Assuming cut handles quoted CSV: It doesn’t—use proper CSV tools
  4. Not considering locale: Sorting behavior changes with LC_COLLATE

Set consistent locale for predictable sorting:

export LC_ALL=C
sort data.txt  # Now uses byte-order sorting

These four commands form the foundation of command-line data processing. Master them, and you’ll find yourself solving complex data problems with elegant one-liners instead of writing throwaway scripts. The key is practice—start incorporating these into your daily workflow, and they’ll become second nature.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.