R - Factors with Examples | Application Architect

Key Insights

Factors are R’s specialized data type for categorical data, storing values as integers with labeled levels for memory efficiency and statistical modeling
Understanding factor ordering (nominal vs ordinal) and level manipulation is critical for correct analysis, visualization, and modeling outcomes
Common pitfalls include unintended factor conversion during data import and surprising behavior when converting factors to numeric values

Understanding Factors in R

Factors represent categorical variables in R, internally stored as integer vectors with associated character labels called levels. This dual nature makes factors memory-efficient while maintaining human-readable labels. R uses factors extensively in statistical modeling, plotting, and data manipulation.

# Creating a basic factor
colors <- factor(c("red", "blue", "red", "green", "blue", "red"))
print(colors)
# [1] red  blue red  green blue red 
# Levels: blue green red

# Check internal structure
str(colors)
# Factor w/ 3 levels "blue","green",..: 3 1 3 2 1 3

# View underlying integers
as.integer(colors)
# [1] 3 1 3 2 1 3

The internal integer representation maps to levels alphabetically by default. “blue” = 1, “green” = 2, “red” = 3.

Creating Factors with Explicit Levels

Controlling level order is essential for analysis and visualization. Specify levels explicitly to override alphabetical ordering.

# Default alphabetical ordering
sizes_default <- factor(c("medium", "small", "large", "medium"))
levels(sizes_default)
# [1] "large"  "medium" "small"

# Explicit logical ordering
sizes_ordered <- factor(c("medium", "small", "large", "medium"),
                       levels = c("small", "medium", "large"))
levels(sizes_ordered)
# [1] "small"  "medium" "large"

# Ordered factors for ordinal data
sizes_ordinal <- factor(c("medium", "small", "large", "medium"),
                       levels = c("small", "medium", "large"),
                       ordered = TRUE)
print(sizes_ordinal)
# [1] medium small  large  medium
# Levels: small < medium < large

# Comparison works with ordered factors
sizes_ordinal[1] > sizes_ordinal[2]
# [1] TRUE

Factor Levels Manipulation

Managing levels is crucial when combining datasets, creating visualizations, or preparing data for modeling.

# Renaming levels
status <- factor(c("A", "B", "A", "C", "B"))
levels(status) <- c("Active", "Pending", "Closed")
print(status)
# [1] Active  Pending Active  Closed  Pending
# Levels: Active Pending Closed

# Adding levels without adding data
responses <- factor(c("yes", "no", "yes"), 
                   levels = c("yes", "no", "maybe"))
levels(responses)
# [1] "yes"   "no"    "maybe"

# Dropping unused levels
subset_responses <- responses[responses != "maybe"]
print(subset_responses)
# [1] yes no  yes
# Levels: yes no maybe

subset_responses <- droplevels(subset_responses)
levels(subset_responses)
# [1] "yes" "no"

# Reordering levels by frequency
library(forcats)
colors <- factor(c("red", "blue", "red", "green", "blue", "red", "red"))
colors_freq <- fct_infreq(colors)
levels(colors_freq)
# [1] "red"   "blue"  "green"

Converting Between Data Types

Factor conversion requires careful handling to avoid data corruption.

# Common mistake: converting factor to numeric directly
ratings <- factor(c("5", "3", "4", "5", "3"))
as.numeric(ratings)  # WRONG - returns level indices
# [1] 2 1 3 2 1

# Correct conversion: factor to character to numeric
as.numeric(as.character(ratings))  # CORRECT
# [1] 5 3 4 5 3

# Alternative using levels indexing
as.numeric(levels(ratings)[ratings])
# [1] 5 3 4 5 3

# Character to factor
char_vec <- c("high", "low", "medium", "high")
factor_vec <- as.factor(char_vec)

# Numeric to factor with labels
ages <- c(1, 2, 3, 1, 2)
age_groups <- factor(ages, 
                    levels = c(1, 2, 3),
                    labels = c("Young", "Middle", "Senior"))
print(age_groups)
# [1] Young  Middle Senior Young  Middle
# Levels: Young Middle Senior

Factors in Data Frames

Data import functions often convert character columns to factors automatically, though modern tidyverse functions default to characters.

# Base R read.csv creates factors by default (R < 4.0)
# df <- read.csv("data.csv")  # strings become factors

# Prevent automatic conversion
df <- read.csv("data.csv", stringsAsFactors = FALSE)

# Modern approach with readr
library(readr)
df <- read_csv("data.csv")  # strings remain characters

# Converting specific columns to factors
df$category <- as.factor(df$category)

# Using dplyr for multiple columns
library(dplyr)
df <- df %>%
  mutate(across(c(category, status), as.factor))

# Example data frame manipulation
sales <- data.frame(
  region = factor(c("North", "South", "North", "West")),
  product = factor(c("A", "B", "A", "A")),
  revenue = c(1000, 1500, 1200, 900)
)

# Grouping operations respect factor levels
library(dplyr)
sales %>%
  group_by(region) %>%
  summarize(total = sum(revenue))

Factors in Statistical Modeling

Factors are essential for regression models, determining how categorical variables are encoded.

# Linear regression with factors
data <- data.frame(
  treatment = factor(c("A", "B", "C", "A", "B", "C")),
  response = c(5, 7, 9, 6, 8, 10)
)

model <- lm(response ~ treatment, data = data)
summary(model)

# Releveling to change reference category
data$treatment <- relevel(data$treatment, ref = "C")
model_releveled <- lm(response ~ treatment, data = data)

# Contrast coding
contrasts(data$treatment)

# Custom contrasts
contrasts(data$treatment) <- contr.sum(3)

# Multiple factor levels in ANOVA
set.seed(123)
experiment <- data.frame(
  temperature = factor(rep(c("Low", "Medium", "High"), each = 10)),
  catalyst = factor(rep(c("X", "Y"), times = 15)),
  yield = rnorm(30, mean = 50, sd = 5)
)

anova_model <- aov(yield ~ temperature * catalyst, data = experiment)
summary(anova_model)

Visualization with Factors

Factor level order directly controls plot appearance.

library(ggplot2)

# Data with unordered factors
survey <- data.frame(
  satisfaction = factor(c("Low", "High", "Medium", "Low", "High", "Medium")),
  count = c(10, 25, 15, 12, 30, 18)
)

# Plot with default alphabetical ordering
ggplot(survey, aes(x = satisfaction, y = count)) +
  geom_col()

# Plot with logical ordering
survey$satisfaction <- factor(survey$satisfaction,
                             levels = c("Low", "Medium", "High"))

ggplot(survey, aes(x = satisfaction, y = count)) +
  geom_col()

# Reorder by another variable using forcats
library(forcats)
survey$satisfaction <- fct_reorder(survey$satisfaction, survey$count)

ggplot(survey, aes(x = satisfaction, y = count)) +
  geom_col() +
  coord_flip()

Advanced Factor Operations with forcats

The forcats package provides powerful tools for factor manipulation.

library(forcats)

# Combine factor levels
fruits <- factor(c("apple", "banana", "orange", "apple", "grape"))
fct_collapse(fruits,
            citrus = "orange",
            other = c("apple", "banana", "grape"))

# Lump infrequent levels
responses <- factor(c(rep("A", 50), rep("B", 30), rep("C", 5), rep("D", 3)))
fct_lump_min(responses, min = 10)  # Keep levels with >= 10 occurrences

# Recode levels
sizes <- factor(c("S", "M", "L", "XL"))
fct_recode(sizes,
          Small = "S",
          Medium = "M",
          Large = "L",
          "Extra Large" = "XL")

# Reverse level order
fct_rev(sizes)

# Anonymous levels for missing data
data_with_na <- factor(c("A", "B", NA, "A"))
fct_explicit_na(data_with_na, na_level = "Missing")

Factors are fundamental to R’s statistical capabilities. Master their creation, manipulation, and conversion to avoid subtle bugs and leverage their power in modeling and visualization workflows.