R - Read Fixed-Width File
Fixed-width files allocate specific character positions for each field. Unlike CSV files that use delimiters, these files rely on consistent positioning. A record might look like this:
Key Insights
- Fixed-width files store data in columns with predetermined character positions, common in legacy systems, government datasets, and mainframe outputs where each field occupies exact spacing regardless of content length
- R provides multiple approaches through
read.fwf(),readr::read_fwf(), and data.table’sfread(), each with distinct performance characteristics and handling of edge cases like embedded spaces and missing values - Proper column specification using position vectors or width vectors is critical—misaligned specifications cause data corruption that’s often difficult to detect until analysis reveals anomalies
Understanding Fixed-Width Format Structure
Fixed-width files allocate specific character positions for each field. Unlike CSV files that use delimiters, these files rely on consistent positioning. A record might look like this:
John Smith 00123451985-03-15New York
Jane Doe 00987652990-07-22Los Angeles
Here, name occupies positions 1-14, ID positions 15-21, date positions 22-31, and city positions 32-45. Parsing requires knowing these exact boundaries.
Basic Parsing with read.fwf()
Base R’s read.fwf() handles fixed-width files but reads the entire file into memory before processing, making it slow for large datasets.
# Define column widths
widths <- c(14, 7, 10, 14)
# Define column names
col_names <- c("name", "id", "birth_date", "city")
# Read the file
data <- read.fwf(
"employees.txt",
widths = widths,
col.names = col_names,
stringsAsFactors = FALSE,
strip.white = TRUE
)
head(data)
The strip.white = TRUE parameter removes leading and trailing spaces, essential since fixed-width files pad fields with spaces. Without it, “John” becomes “John “.
For files with headers, skip the first line:
data <- read.fwf(
"employees.txt",
widths = widths,
col.names = col_names,
skip = 1,
stringsAsFactors = FALSE,
strip.white = TRUE
)
Using Column Positions Instead of Widths
When you know start and end positions rather than widths, specify them directly:
# Columns: 1-14, 15-21, 22-31, 32-45
positions <- list(
c(1, 14),
c(15, 21),
c(22, 31),
c(32, 45)
)
# Convert to widths format
widths_from_pos <- c(14, 7, 10, 14)
# Or use fwf_positions with readr (shown later)
High-Performance Parsing with readr
The readr package provides read_fwf(), which processes files in chunks and handles type conversion automatically:
library(readr)
# Using column positions
data <- read_fwf(
"employees.txt",
fwf_positions(
start = c(1, 15, 22, 32),
end = c(14, 21, 31, 45),
col_names = c("name", "id", "birth_date", "city")
),
trim_ws = TRUE
)
# Using column widths
data <- read_fwf(
"employees.txt",
fwf_widths(
widths = c(14, 7, 10, 14),
col_names = c("name", "id", "birth_date", "city")
),
trim_ws = TRUE
)
The readr approach automatically detects column types:
spec(data)
# Shows detected column specifications
Override automatic type detection when needed:
data <- read_fwf(
"employees.txt",
fwf_widths(c(14, 7, 10, 14),
c("name", "id", "birth_date", "city")),
col_types = cols(
name = col_character(),
id = col_integer(),
birth_date = col_date(format = "%Y-%m-%d"),
city = col_character()
),
trim_ws = TRUE
)
Handling Complex Scenarios
Files Without Column Specifications
When documentation is missing, analyze the file structure manually:
# Read first few lines as raw text
lines <- readLines("employees.txt", n = 10)
print(lines)
# Identify column boundaries by examining alignment
# Look for consistent spacing patterns
Create a visual ruler to identify positions:
ruler <- paste0(rep(seq(0, 9), 10), collapse = "")
cat(ruler, "\n")
cat(lines[1], "\n")
cat(lines[2], "\n")
Ragged Right Files
Some fixed-width files don’t pad the last column:
# Last column can be variable length
data <- read_fwf(
"employees.txt",
fwf_widths(
c(14, 7, 10, NA), # NA means "read to end of line"
c("name", "id", "birth_date", "city")
),
trim_ws = TRUE
)
Embedded Column Specifications
Some files include column specs in headers:
NAME ID BIRTHDATE CITY
--------------+------+---------+-------------
John Smith 001234 1985-03-15New York
Parse the header to extract positions:
header <- readLines("employees.txt", n = 2)
separator_line <- header[2]
# Find '+' positions to determine boundaries
boundaries <- gregexpr("\\+", separator_line)[[1]]
starts <- c(1, boundaries + 1)
ends <- c(boundaries - 1, nchar(separator_line))
# Extract column names
col_names <- trimws(unlist(strsplit(header[1], "\\s{2,}")))
Large File Processing with data.table
For multi-gigabyte files, data.table::fread() with custom processing offers superior performance:
library(data.table)
# Read as single column, then split
dt <- fread(
"employees.txt",
sep = NULL,
header = FALSE,
col.names = "raw"
)
# Extract columns using substring
dt[, `:=`(
name = trimws(substr(raw, 1, 14)),
id = as.integer(trimws(substr(raw, 15, 21))),
birth_date = as.Date(trimws(substr(raw, 22, 31))),
city = trimws(substr(raw, 32, 45))
)]
dt[, raw := NULL] # Remove raw column
This approach reads data once and processes in-memory, significantly faster than read.fwf() for large files.
Validating Parsed Data
After parsing, validate that column boundaries are correct:
# Check for unexpected patterns
summary(nchar(data$name)) # Should be consistent or vary logically
summary(nchar(data$id)) # Should match expected length
# Check for data bleeding between columns
# Look for numbers in text fields or text in numeric fields
data[grepl("\\d{5}", data$city), ] # IDs bleeding into city field?
Verify row counts match expectations:
# Count lines in original file
line_count <- length(readLines("employees.txt"))
# Compare with parsed rows
nrow(data)
# Account for headers/footers if present
Writing Fixed-Width Files
Create fixed-width output using sprintf():
# Format each column to exact width
formatted <- data.frame(
name = sprintf("%-14s", substr(data$name, 1, 14)),
id = sprintf("%07d", data$id),
birth_date = sprintf("%-10s", as.character(data$birth_date)),
city = sprintf("%-14s", substr(data$city, 1, 14))
)
# Combine and write
output <- do.call(paste0, formatted)
writeLines(output, "output.txt")
The %-14s format left-aligns strings in 14 characters, while %07d right-aligns integers with zero-padding.
Performance Comparison
Benchmark different approaches on a 100MB file with 1 million records:
library(microbenchmark)
microbenchmark(
base_r = read.fwf("large.txt", widths = c(14, 7, 10, 14)),
readr = read_fwf("large.txt", fwf_widths(c(14, 7, 10, 14))),
data_table = {
dt <- fread("large.txt", sep = NULL, header = FALSE)
dt[, .(
name = substr(V1, 1, 14),
id = substr(V1, 15, 21)
)]
},
times = 5
)
Typical results show data.table 10-50x faster than read.fwf(), with readr falling between them.
Common Pitfalls
Incorrect width calculations: Off-by-one errors cause columns to shift. Always verify with sample data before processing large files.
Character encoding issues: Legacy systems often use non-UTF-8 encodings. Specify encoding explicitly:
data <- read_fwf("employees.txt",
fwf_widths(c(14, 7, 10, 14)),
locale = locale(encoding = "latin1"))
Missing value handling: Fixed-width files may use spaces or special codes for missing data. Define explicitly:
data <- read_fwf("employees.txt",
fwf_widths(c(14, 7, 10, 14)),
na = c("", "NA", "9999", " "))
Fixed-width files remain prevalent in enterprise environments. Understanding column specifications, choosing appropriate parsing tools, and validating results ensures reliable data extraction from these legacy formats.