R - Factors with Examples
Factors represent categorical variables in R, internally stored as integer vectors with associated character labels called levels. This dual nature makes factors memory-efficient while maintaining...
Key Insights
- Factors are R’s specialized data type for categorical data, storing values as integers with labeled levels for memory efficiency and statistical modeling
- Understanding factor ordering (nominal vs ordinal) and level manipulation is critical for correct analysis, visualization, and modeling outcomes
- Common pitfalls include unintended factor conversion during data import and surprising behavior when converting factors to numeric values
Understanding Factors in R
Factors represent categorical variables in R, internally stored as integer vectors with associated character labels called levels. This dual nature makes factors memory-efficient while maintaining human-readable labels. R uses factors extensively in statistical modeling, plotting, and data manipulation.
# Creating a basic factor
colors <- factor(c("red", "blue", "red", "green", "blue", "red"))
print(colors)
# [1] red blue red green blue red
# Levels: blue green red
# Check internal structure
str(colors)
# Factor w/ 3 levels "blue","green",..: 3 1 3 2 1 3
# View underlying integers
as.integer(colors)
# [1] 3 1 3 2 1 3
The internal integer representation maps to levels alphabetically by default. “blue” = 1, “green” = 2, “red” = 3.
Creating Factors with Explicit Levels
Controlling level order is essential for analysis and visualization. Specify levels explicitly to override alphabetical ordering.
# Default alphabetical ordering
sizes_default <- factor(c("medium", "small", "large", "medium"))
levels(sizes_default)
# [1] "large" "medium" "small"
# Explicit logical ordering
sizes_ordered <- factor(c("medium", "small", "large", "medium"),
levels = c("small", "medium", "large"))
levels(sizes_ordered)
# [1] "small" "medium" "large"
# Ordered factors for ordinal data
sizes_ordinal <- factor(c("medium", "small", "large", "medium"),
levels = c("small", "medium", "large"),
ordered = TRUE)
print(sizes_ordinal)
# [1] medium small large medium
# Levels: small < medium < large
# Comparison works with ordered factors
sizes_ordinal[1] > sizes_ordinal[2]
# [1] TRUE
Factor Levels Manipulation
Managing levels is crucial when combining datasets, creating visualizations, or preparing data for modeling.
# Renaming levels
status <- factor(c("A", "B", "A", "C", "B"))
levels(status) <- c("Active", "Pending", "Closed")
print(status)
# [1] Active Pending Active Closed Pending
# Levels: Active Pending Closed
# Adding levels without adding data
responses <- factor(c("yes", "no", "yes"),
levels = c("yes", "no", "maybe"))
levels(responses)
# [1] "yes" "no" "maybe"
# Dropping unused levels
subset_responses <- responses[responses != "maybe"]
print(subset_responses)
# [1] yes no yes
# Levels: yes no maybe
subset_responses <- droplevels(subset_responses)
levels(subset_responses)
# [1] "yes" "no"
# Reordering levels by frequency
library(forcats)
colors <- factor(c("red", "blue", "red", "green", "blue", "red", "red"))
colors_freq <- fct_infreq(colors)
levels(colors_freq)
# [1] "red" "blue" "green"
Converting Between Data Types
Factor conversion requires careful handling to avoid data corruption.
# Common mistake: converting factor to numeric directly
ratings <- factor(c("5", "3", "4", "5", "3"))
as.numeric(ratings) # WRONG - returns level indices
# [1] 2 1 3 2 1
# Correct conversion: factor to character to numeric
as.numeric(as.character(ratings)) # CORRECT
# [1] 5 3 4 5 3
# Alternative using levels indexing
as.numeric(levels(ratings)[ratings])
# [1] 5 3 4 5 3
# Character to factor
char_vec <- c("high", "low", "medium", "high")
factor_vec <- as.factor(char_vec)
# Numeric to factor with labels
ages <- c(1, 2, 3, 1, 2)
age_groups <- factor(ages,
levels = c(1, 2, 3),
labels = c("Young", "Middle", "Senior"))
print(age_groups)
# [1] Young Middle Senior Young Middle
# Levels: Young Middle Senior
Factors in Data Frames
Data import functions often convert character columns to factors automatically, though modern tidyverse functions default to characters.
# Base R read.csv creates factors by default (R < 4.0)
# df <- read.csv("data.csv") # strings become factors
# Prevent automatic conversion
df <- read.csv("data.csv", stringsAsFactors = FALSE)
# Modern approach with readr
library(readr)
df <- read_csv("data.csv") # strings remain characters
# Converting specific columns to factors
df$category <- as.factor(df$category)
# Using dplyr for multiple columns
library(dplyr)
df <- df %>%
mutate(across(c(category, status), as.factor))
# Example data frame manipulation
sales <- data.frame(
region = factor(c("North", "South", "North", "West")),
product = factor(c("A", "B", "A", "A")),
revenue = c(1000, 1500, 1200, 900)
)
# Grouping operations respect factor levels
library(dplyr)
sales %>%
group_by(region) %>%
summarize(total = sum(revenue))
Factors in Statistical Modeling
Factors are essential for regression models, determining how categorical variables are encoded.
# Linear regression with factors
data <- data.frame(
treatment = factor(c("A", "B", "C", "A", "B", "C")),
response = c(5, 7, 9, 6, 8, 10)
)
model <- lm(response ~ treatment, data = data)
summary(model)
# Releveling to change reference category
data$treatment <- relevel(data$treatment, ref = "C")
model_releveled <- lm(response ~ treatment, data = data)
# Contrast coding
contrasts(data$treatment)
# Custom contrasts
contrasts(data$treatment) <- contr.sum(3)
# Multiple factor levels in ANOVA
set.seed(123)
experiment <- data.frame(
temperature = factor(rep(c("Low", "Medium", "High"), each = 10)),
catalyst = factor(rep(c("X", "Y"), times = 15)),
yield = rnorm(30, mean = 50, sd = 5)
)
anova_model <- aov(yield ~ temperature * catalyst, data = experiment)
summary(anova_model)
Visualization with Factors
Factor level order directly controls plot appearance.
library(ggplot2)
# Data with unordered factors
survey <- data.frame(
satisfaction = factor(c("Low", "High", "Medium", "Low", "High", "Medium")),
count = c(10, 25, 15, 12, 30, 18)
)
# Plot with default alphabetical ordering
ggplot(survey, aes(x = satisfaction, y = count)) +
geom_col()
# Plot with logical ordering
survey$satisfaction <- factor(survey$satisfaction,
levels = c("Low", "Medium", "High"))
ggplot(survey, aes(x = satisfaction, y = count)) +
geom_col()
# Reorder by another variable using forcats
library(forcats)
survey$satisfaction <- fct_reorder(survey$satisfaction, survey$count)
ggplot(survey, aes(x = satisfaction, y = count)) +
geom_col() +
coord_flip()
Advanced Factor Operations with forcats
The forcats package provides powerful tools for factor manipulation.
library(forcats)
# Combine factor levels
fruits <- factor(c("apple", "banana", "orange", "apple", "grape"))
fct_collapse(fruits,
citrus = "orange",
other = c("apple", "banana", "grape"))
# Lump infrequent levels
responses <- factor(c(rep("A", 50), rep("B", 30), rep("C", 5), rep("D", 3)))
fct_lump_min(responses, min = 10) # Keep levels with >= 10 occurrences
# Recode levels
sizes <- factor(c("S", "M", "L", "XL"))
fct_recode(sizes,
Small = "S",
Medium = "M",
Large = "L",
"Extra Large" = "XL")
# Reverse level order
fct_rev(sizes)
# Anonymous levels for missing data
data_with_na <- factor(c("A", "B", NA, "A"))
fct_explicit_na(data_with_na, na_level = "Missing")
Factors are fundamental to R’s statistical capabilities. Master their creation, manipulation, and conversion to avoid subtle bugs and leverage their power in modeling and visualization workflows.