R - Read Fixed-Width File | Application Architect

Key Insights

Fixed-width files store data in columns with predetermined character positions, common in legacy systems, government datasets, and mainframe outputs where each field occupies exact spacing regardless of content length
R provides multiple approaches through read.fwf(), readr::read_fwf(), and data.table’s fread(), each with distinct performance characteristics and handling of edge cases like embedded spaces and missing values
Proper column specification using position vectors or width vectors is critical—misaligned specifications cause data corruption that’s often difficult to detect until analysis reveals anomalies

Understanding Fixed-Width Format Structure

Fixed-width files allocate specific character positions for each field. Unlike CSV files that use delimiters, these files rely on consistent positioning. A record might look like this:

John Smith    00123451985-03-15New York     
Jane Doe      00987652990-07-22Los Angeles

Here, name occupies positions 1-14, ID positions 15-21, date positions 22-31, and city positions 32-45. Parsing requires knowing these exact boundaries.

Basic Parsing with read.fwf()

Base R’s read.fwf() handles fixed-width files but reads the entire file into memory before processing, making it slow for large datasets.

# Define column widths
widths <- c(14, 7, 10, 14)

# Define column names
col_names <- c("name", "id", "birth_date", "city")

# Read the file
data <- read.fwf(
  "employees.txt",
  widths = widths,
  col.names = col_names,
  stringsAsFactors = FALSE,
  strip.white = TRUE
)

head(data)

The strip.white = TRUE parameter removes leading and trailing spaces, essential since fixed-width files pad fields with spaces. Without it, “John” becomes “John “.

For files with headers, skip the first line:

data <- read.fwf(
  "employees.txt",
  widths = widths,
  col.names = col_names,
  skip = 1,
  stringsAsFactors = FALSE,
  strip.white = TRUE
)

Using Column Positions Instead of Widths

When you know start and end positions rather than widths, specify them directly:

# Columns: 1-14, 15-21, 22-31, 32-45
positions <- list(
  c(1, 14),
  c(15, 21),
  c(22, 31),
  c(32, 45)
)

# Convert to widths format
widths_from_pos <- c(14, 7, 10, 14)

# Or use fwf_positions with readr (shown later)

High-Performance Parsing with readr

The readr package provides read_fwf(), which processes files in chunks and handles type conversion automatically:

library(readr)

# Using column positions
data <- read_fwf(
  "employees.txt",
  fwf_positions(
    start = c(1, 15, 22, 32),
    end = c(14, 21, 31, 45),
    col_names = c("name", "id", "birth_date", "city")
  ),
  trim_ws = TRUE
)

# Using column widths
data <- read_fwf(
  "employees.txt",
  fwf_widths(
    widths = c(14, 7, 10, 14),
    col_names = c("name", "id", "birth_date", "city")
  ),
  trim_ws = TRUE
)

The readr approach automatically detects column types:

spec(data)
# Shows detected column specifications

Override automatic type detection when needed:

data <- read_fwf(
  "employees.txt",
  fwf_widths(c(14, 7, 10, 14), 
             c("name", "id", "birth_date", "city")),
  col_types = cols(
    name = col_character(),
    id = col_integer(),
    birth_date = col_date(format = "%Y-%m-%d"),
    city = col_character()
  ),
  trim_ws = TRUE
)

Handling Complex Scenarios

Files Without Column Specifications

When documentation is missing, analyze the file structure manually:

# Read first few lines as raw text
lines <- readLines("employees.txt", n = 10)
print(lines)

# Identify column boundaries by examining alignment
# Look for consistent spacing patterns

Create a visual ruler to identify positions:

ruler <- paste0(rep(seq(0, 9), 10), collapse = "")
cat(ruler, "\n")
cat(lines[1], "\n")
cat(lines[2], "\n")

Ragged Right Files

Some fixed-width files don’t pad the last column:

# Last column can be variable length
data <- read_fwf(
  "employees.txt",
  fwf_widths(
    c(14, 7, 10, NA),  # NA means "read to end of line"
    c("name", "id", "birth_date", "city")
  ),
  trim_ws = TRUE
)

Embedded Column Specifications

Some files include column specs in headers:

NAME          ID     BIRTHDATE CITY
--------------+------+---------+-------------
John Smith    001234 1985-03-15New York

Parse the header to extract positions:

header <- readLines("employees.txt", n = 2)
separator_line <- header[2]

# Find '+' positions to determine boundaries
boundaries <- gregexpr("\\+", separator_line)[[1]]
starts <- c(1, boundaries + 1)
ends <- c(boundaries - 1, nchar(separator_line))

# Extract column names
col_names <- trimws(unlist(strsplit(header[1], "\\s{2,}")))

Large File Processing with data.table

For multi-gigabyte files, data.table::fread() with custom processing offers superior performance:

library(data.table)

# Read as single column, then split
dt <- fread(
  "employees.txt",
  sep = NULL,
  header = FALSE,
  col.names = "raw"
)

# Extract columns using substring
dt[, `:=`(
  name = trimws(substr(raw, 1, 14)),
  id = as.integer(trimws(substr(raw, 15, 21))),
  birth_date = as.Date(trimws(substr(raw, 22, 31))),
  city = trimws(substr(raw, 32, 45))
)]

dt[, raw := NULL]  # Remove raw column

This approach reads data once and processes in-memory, significantly faster than read.fwf() for large files.

Validating Parsed Data

After parsing, validate that column boundaries are correct:

# Check for unexpected patterns
summary(nchar(data$name))  # Should be consistent or vary logically
summary(nchar(data$id))     # Should match expected length

# Check for data bleeding between columns
# Look for numbers in text fields or text in numeric fields
data[grepl("\\d{5}", data$city), ]  # IDs bleeding into city field?

Verify row counts match expectations:

# Count lines in original file
line_count <- length(readLines("employees.txt"))

# Compare with parsed rows
nrow(data)

# Account for headers/footers if present

Writing Fixed-Width Files

Create fixed-width output using sprintf():

# Format each column to exact width
formatted <- data.frame(
  name = sprintf("%-14s", substr(data$name, 1, 14)),
  id = sprintf("%07d", data$id),
  birth_date = sprintf("%-10s", as.character(data$birth_date)),
  city = sprintf("%-14s", substr(data$city, 1, 14))
)

# Combine and write
output <- do.call(paste0, formatted)
writeLines(output, "output.txt")

The %-14s format left-aligns strings in 14 characters, while %07d right-aligns integers with zero-padding.

Performance Comparison

Benchmark different approaches on a 100MB file with 1 million records:

library(microbenchmark)

microbenchmark(
  base_r = read.fwf("large.txt", widths = c(14, 7, 10, 14)),
  readr = read_fwf("large.txt", fwf_widths(c(14, 7, 10, 14))),
  data_table = {
    dt <- fread("large.txt", sep = NULL, header = FALSE)
    dt[, .(
      name = substr(V1, 1, 14),
      id = substr(V1, 15, 21)
    )]
  },
  times = 5
)

Typical results show data.table 10-50x faster than read.fwf(), with readr falling between them.

Common Pitfalls

Incorrect width calculations: Off-by-one errors cause columns to shift. Always verify with sample data before processing large files.

Character encoding issues: Legacy systems often use non-UTF-8 encodings. Specify encoding explicitly:

data <- read_fwf("employees.txt", 
                 fwf_widths(c(14, 7, 10, 14)),
                 locale = locale(encoding = "latin1"))

Missing value handling: Fixed-width files may use spaces or special codes for missing data. Define explicitly:

data <- read_fwf("employees.txt",
                 fwf_widths(c(14, 7, 10, 14)),
                 na = c("", "NA", "9999", "       "))

Fixed-width files remain prevalent in enterprise environments. Understanding column specifications, choosing appropriate parsing tools, and validating results ensures reliable data extraction from these legacy formats.