R - cut() - Bin Continuous Data
The `cut()` function divides a numeric vector into intervals and returns a factor representing which interval each value falls into. The basic syntax requires two arguments: the data vector and the...
Key Insights
- The
cut()function transforms continuous numeric data into discrete intervals (bins), essential for data analysis, visualization, and statistical modeling when you need to group values into meaningful ranges. - Control bin boundaries with
breaksparameter using either the number of intervals or explicit cutpoints, whilelabels,include.lowest, andrightparameters fine-tune categorization behavior. - Combine
cut()withtable(),aggregate(), andggplot2for powerful data summarization and visualization workflows that reveal patterns obscured in raw continuous data.
Basic Syntax and Default Behavior
The cut() function divides a numeric vector into intervals and returns a factor representing which interval each value falls into. The basic syntax requires two arguments: the data vector and the breaks specification.
# Generate sample data
ages <- c(18, 25, 32, 45, 52, 61, 28, 73, 19, 38, 55, 67)
# Create 4 equal-width intervals
age_groups <- cut(ages, breaks = 4)
print(age_groups)
[1] (17.9,31.8] (17.9,31.8] (31.8,45.5] (45.5,59.2] (45.5,59.2]
[6] (59.2,73] (17.9,31.8] (59.2,73] (17.9,31.8] (31.8,45.5]
[11] (45.5,59.2] (59.2,73]
Levels: (17.9,31.8] (31.8,45.5] (45.5,59.2] (59.2,73]
By default, intervals are left-open and right-closed: (a,b] means values greater than a and less than or equal to b. The function automatically calculates interval boundaries to create equal-width bins.
Explicit Break Points
Specify exact boundaries for more control over categorization. This approach works better when you have domain knowledge about meaningful thresholds.
# Define custom age brackets
ages <- c(18, 25, 32, 45, 52, 61, 28, 73, 19, 38, 55, 67)
age_categories <- cut(ages,
breaks = c(0, 30, 50, 70, 100),
labels = c("Young Adult", "Middle Age", "Senior", "Elderly"))
# Create a frequency table
table(age_categories)
Young Adult Middle Age Senior Elderly
4 3 3 2
When providing explicit breaks, ensure the range covers all your data. Values outside the break range become NA.
# Demonstrate NA handling with insufficient range
incomplete_cut <- cut(ages, breaks = c(20, 40, 60))
sum(is.na(incomplete_cut)) # Returns 3 (values 18, 19, 73)
Controlling Interval Boundaries
The right and include.lowest parameters control which endpoint each interval includes, critical when values fall exactly on boundaries.
scores <- c(0, 25, 50, 75, 100)
# Right-closed intervals (default)
cut(scores, breaks = c(0, 50, 100), right = TRUE)
# (0,50] (0,50] (0,50] (50,100] (50,100]
# Left-closed intervals
cut(scores, breaks = c(0, 50, 100), right = FALSE)
# [0,50) [0,50) [0,50) [50,100) [100,100)
# Include the lowest value in the first interval
cut(scores, breaks = c(0, 50, 100), right = TRUE, include.lowest = TRUE)
# [0,50] [0,50] [0,50] (50,100] (50,100]
The include.lowest = TRUE parameter changes the leftmost interval from open to closed, ensuring boundary values aren’t excluded.
Custom Labels and Ordered Factors
Replace default interval notation with meaningful labels for cleaner output and better interpretability.
income <- c(25000, 45000, 62000, 38000, 95000, 125000, 31000, 78000)
income_brackets <- cut(income,
breaks = c(0, 40000, 75000, 100000, Inf),
labels = c("Low", "Medium", "High", "Very High"),
ordered_result = TRUE)
print(income_brackets)
[1] Low Medium Medium Low High Very High Low
[8] High
Levels: Low < Medium < High < Very High
Setting ordered_result = TRUE creates an ordered factor, useful for statistical models that recognize ordinal relationships.
Practical Application: Data Aggregation
Combine cut() with aggregation functions to summarize continuous data by bins.
# Create sample dataset
set.seed(42)
sales_data <- data.frame(
revenue = runif(100, 1000, 50000),
profit_margin = runif(100, 0.05, 0.35)
)
# Bin revenue into quartiles
sales_data$revenue_tier <- cut(sales_data$revenue,
breaks = quantile(sales_data$revenue,
probs = c(0, 0.25, 0.5, 0.75, 1)),
labels = c("Q1", "Q2", "Q3", "Q4"),
include.lowest = TRUE)
# Calculate average profit margin by revenue tier
aggregate(profit_margin ~ revenue_tier, data = sales_data, FUN = mean)
revenue_tier profit_margin
1 Q1 0.1952847
2 Q2 0.2089431
3 Q3 0.1876542
4 Q4 0.2134891
Integration with ggplot2
Binned data integrates seamlessly with visualization workflows for histogram alternatives and faceted plots.
library(ggplot2)
# Generate sample data
set.seed(123)
temperature_data <- data.frame(
temp_celsius = rnorm(500, mean = 22, sd = 5),
humidity = rnorm(500, mean = 65, sd = 15)
)
# Create temperature bins
temperature_data$temp_category <- cut(temperature_data$temp_celsius,
breaks = c(-Inf, 15, 20, 25, Inf),
labels = c("Cold", "Cool", "Comfortable", "Warm"))
# Visualize humidity distribution by temperature category
ggplot(temperature_data, aes(x = temp_category, y = humidity, fill = temp_category)) +
geom_boxplot() +
labs(title = "Humidity Distribution by Temperature Category",
x = "Temperature Category",
y = "Humidity (%)") +
theme_minimal() +
theme(legend.position = "none")
Handling Edge Cases
Address common issues with missing values, infinite boundaries, and single-value bins.
# Data with NA values
data_with_na <- c(10, 20, NA, 30, 40, NA, 50)
binned_na <- cut(data_with_na, breaks = 3)
# NAs are preserved in output
# Using Inf for open-ended intervals
income_open <- c(15000, 45000, 85000, 150000, 250000)
income_bins <- cut(income_open,
breaks = c(0, 50000, 100000, Inf),
labels = c("Low", "Medium", "High"))
# Verify bin assignments
data.frame(income = income_open, bracket = income_bins)
income bracket
1 15000 Low
2 45000 Low
3 85000 Medium
4 150000 High
5 250000 High
Performance Considerations with Large Datasets
For large datasets, cut() performs efficiently, but consider using findInterval() for repeated binning operations with the same breaks.
# Benchmark comparison
large_data <- runif(1e6, 0, 100)
breaks <- c(0, 25, 50, 75, 100)
# Using cut()
system.time({
result_cut <- cut(large_data, breaks = breaks)
})
# Using findInterval() for numeric output
system.time({
result_interval <- findInterval(large_data, breaks)
})
# findInterval is faster but returns integers, not factors
# Use cut() when you need factor levels for modeling or visualization
Creating Equal-Frequency Bins
Use quantiles to create bins with approximately equal numbers of observations rather than equal widths.
# Generate skewed data
skewed_data <- rexp(1000, rate = 0.1)
# Equal-width bins (poor distribution)
equal_width <- cut(skewed_data, breaks = 5)
table(equal_width)
# Equal-frequency bins using quantiles
quantile_breaks <- quantile(skewed_data, probs = seq(0, 1, 0.2))
equal_freq <- cut(skewed_data,
breaks = quantile_breaks,
include.lowest = TRUE,
labels = c("Q1", "Q2", "Q3", "Q4", "Q5"))
table(equal_freq)
This approach ensures balanced sample sizes across bins, important for statistical analyses requiring similar group sizes.