R - cut() - Bin Continuous Data

The `cut()` function divides a numeric vector into intervals and returns a factor representing which interval each value falls into. The basic syntax requires two arguments: the data vector and the...

Key Insights

  • The cut() function transforms continuous numeric data into discrete intervals (bins), essential for data analysis, visualization, and statistical modeling when you need to group values into meaningful ranges.
  • Control bin boundaries with breaks parameter using either the number of intervals or explicit cutpoints, while labels, include.lowest, and right parameters fine-tune categorization behavior.
  • Combine cut() with table(), aggregate(), and ggplot2 for powerful data summarization and visualization workflows that reveal patterns obscured in raw continuous data.

Basic Syntax and Default Behavior

The cut() function divides a numeric vector into intervals and returns a factor representing which interval each value falls into. The basic syntax requires two arguments: the data vector and the breaks specification.

# Generate sample data
ages <- c(18, 25, 32, 45, 52, 61, 28, 73, 19, 38, 55, 67)

# Create 4 equal-width intervals
age_groups <- cut(ages, breaks = 4)
print(age_groups)
[1] (17.9,31.8] (17.9,31.8] (31.8,45.5] (45.5,59.2] (45.5,59.2]
[6] (59.2,73]   (17.9,31.8] (59.2,73]   (17.9,31.8] (31.8,45.5]
[11] (45.5,59.2] (59.2,73]  
Levels: (17.9,31.8] (31.8,45.5] (45.5,59.2] (59.2,73]

By default, intervals are left-open and right-closed: (a,b] means values greater than a and less than or equal to b. The function automatically calculates interval boundaries to create equal-width bins.

Explicit Break Points

Specify exact boundaries for more control over categorization. This approach works better when you have domain knowledge about meaningful thresholds.

# Define custom age brackets
ages <- c(18, 25, 32, 45, 52, 61, 28, 73, 19, 38, 55, 67)
age_categories <- cut(ages, 
                      breaks = c(0, 30, 50, 70, 100),
                      labels = c("Young Adult", "Middle Age", "Senior", "Elderly"))

# Create a frequency table
table(age_categories)
Young Adult  Middle Age      Senior     Elderly 
          4           3           3           2 

When providing explicit breaks, ensure the range covers all your data. Values outside the break range become NA.

# Demonstrate NA handling with insufficient range
incomplete_cut <- cut(ages, breaks = c(20, 40, 60))
sum(is.na(incomplete_cut))  # Returns 3 (values 18, 19, 73)

Controlling Interval Boundaries

The right and include.lowest parameters control which endpoint each interval includes, critical when values fall exactly on boundaries.

scores <- c(0, 25, 50, 75, 100)

# Right-closed intervals (default)
cut(scores, breaks = c(0, 50, 100), right = TRUE)
# (0,50] (0,50] (0,50] (50,100] (50,100]

# Left-closed intervals
cut(scores, breaks = c(0, 50, 100), right = FALSE)
# [0,50) [0,50) [0,50) [50,100) [100,100)

# Include the lowest value in the first interval
cut(scores, breaks = c(0, 50, 100), right = TRUE, include.lowest = TRUE)
# [0,50] [0,50] [0,50] (50,100] (50,100]

The include.lowest = TRUE parameter changes the leftmost interval from open to closed, ensuring boundary values aren’t excluded.

Custom Labels and Ordered Factors

Replace default interval notation with meaningful labels for cleaner output and better interpretability.

income <- c(25000, 45000, 62000, 38000, 95000, 125000, 31000, 78000)

income_brackets <- cut(income,
                       breaks = c(0, 40000, 75000, 100000, Inf),
                       labels = c("Low", "Medium", "High", "Very High"),
                       ordered_result = TRUE)

print(income_brackets)
[1] Low       Medium    Medium    Low       High      Very High Low      
[8] High     
Levels: Low < Medium < High < Very High

Setting ordered_result = TRUE creates an ordered factor, useful for statistical models that recognize ordinal relationships.

Practical Application: Data Aggregation

Combine cut() with aggregation functions to summarize continuous data by bins.

# Create sample dataset
set.seed(42)
sales_data <- data.frame(
  revenue = runif(100, 1000, 50000),
  profit_margin = runif(100, 0.05, 0.35)
)

# Bin revenue into quartiles
sales_data$revenue_tier <- cut(sales_data$revenue,
                                breaks = quantile(sales_data$revenue, 
                                                 probs = c(0, 0.25, 0.5, 0.75, 1)),
                                labels = c("Q1", "Q2", "Q3", "Q4"),
                                include.lowest = TRUE)

# Calculate average profit margin by revenue tier
aggregate(profit_margin ~ revenue_tier, data = sales_data, FUN = mean)
  revenue_tier profit_margin
1           Q1     0.1952847
2           Q2     0.2089431
3           Q3     0.1876542
4           Q4     0.2134891

Integration with ggplot2

Binned data integrates seamlessly with visualization workflows for histogram alternatives and faceted plots.

library(ggplot2)

# Generate sample data
set.seed(123)
temperature_data <- data.frame(
  temp_celsius = rnorm(500, mean = 22, sd = 5),
  humidity = rnorm(500, mean = 65, sd = 15)
)

# Create temperature bins
temperature_data$temp_category <- cut(temperature_data$temp_celsius,
                                      breaks = c(-Inf, 15, 20, 25, Inf),
                                      labels = c("Cold", "Cool", "Comfortable", "Warm"))

# Visualize humidity distribution by temperature category
ggplot(temperature_data, aes(x = temp_category, y = humidity, fill = temp_category)) +
  geom_boxplot() +
  labs(title = "Humidity Distribution by Temperature Category",
       x = "Temperature Category",
       y = "Humidity (%)") +
  theme_minimal() +
  theme(legend.position = "none")

Handling Edge Cases

Address common issues with missing values, infinite boundaries, and single-value bins.

# Data with NA values
data_with_na <- c(10, 20, NA, 30, 40, NA, 50)
binned_na <- cut(data_with_na, breaks = 3)
# NAs are preserved in output

# Using Inf for open-ended intervals
income_open <- c(15000, 45000, 85000, 150000, 250000)
income_bins <- cut(income_open,
                   breaks = c(0, 50000, 100000, Inf),
                   labels = c("Low", "Medium", "High"))

# Verify bin assignments
data.frame(income = income_open, bracket = income_bins)
  income bracket
1  15000     Low
2  45000     Low
3  85000  Medium
4 150000    High
5 250000    High

Performance Considerations with Large Datasets

For large datasets, cut() performs efficiently, but consider using findInterval() for repeated binning operations with the same breaks.

# Benchmark comparison
large_data <- runif(1e6, 0, 100)
breaks <- c(0, 25, 50, 75, 100)

# Using cut()
system.time({
  result_cut <- cut(large_data, breaks = breaks)
})

# Using findInterval() for numeric output
system.time({
  result_interval <- findInterval(large_data, breaks)
})

# findInterval is faster but returns integers, not factors
# Use cut() when you need factor levels for modeling or visualization

Creating Equal-Frequency Bins

Use quantiles to create bins with approximately equal numbers of observations rather than equal widths.

# Generate skewed data
skewed_data <- rexp(1000, rate = 0.1)

# Equal-width bins (poor distribution)
equal_width <- cut(skewed_data, breaks = 5)
table(equal_width)

# Equal-frequency bins using quantiles
quantile_breaks <- quantile(skewed_data, probs = seq(0, 1, 0.2))
equal_freq <- cut(skewed_data, 
                  breaks = quantile_breaks,
                  include.lowest = TRUE,
                  labels = c("Q1", "Q2", "Q3", "Q4", "Q5"))
table(equal_freq)

This approach ensures balanced sample sizes across bins, important for statistical analyses requiring similar group sizes.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.