How to Calculate the Median in R
The median represents the middle value in a sorted dataset. When you arrange your data from smallest to largest, the median sits exactly at the center—half the values fall below it, half above. For...
Key Insights
- R’s built-in
median()function handles most use cases, but you must explicitly setna.rm = TRUEto avoidNAresults when your data contains missing values - For grouped median calculations,
dplyr::summarise()combined withgroup_by()provides the cleanest syntax, though base R’saggregate()works without dependencies - Understanding the manual calculation (sorting, then selecting the middle value or averaging two middle values) helps you debug edge cases and implement custom weighted median functions
Introduction to the Median
The median represents the middle value in a sorted dataset. When you arrange your data from smallest to largest, the median sits exactly at the center—half the values fall below it, half above. For odd-length datasets, it’s the literal middle value. For even-length datasets, it’s the average of the two middle values.
Why use median instead of mean? Robustness to outliers. Consider salaries at a small company: five employees earn $50,000, and the CEO earns $2,000,000. The mean salary is $375,000—a number that describes nobody’s actual compensation. The median is $50,000, which accurately represents the typical employee’s pay.
This robustness makes median essential for analyzing income distributions, housing prices, response times, and any dataset where extreme values might distort your understanding of “typical” behavior.
Using the Built-in median() Function
R provides median() as part of its base statistics functions. The syntax is straightforward:
# Basic median calculation
values <- c(3, 7, 2, 9, 5)
median(values)
# [1] 5
# With an even number of elements
values_even <- c(3, 7, 2, 9, 5, 11)
median(values_even)
# [1] 6
In the even-length example, R sorts the values to c(2, 3, 5, 7, 9, 11), identifies the two middle values (5 and 7), and returns their average (6).
The critical parameter you’ll use constantly is na.rm. By default, median() returns NA if any value is missing:
# Missing values cause NA result by default
values_with_na <- c(3, 7, NA, 9, 5)
median(values_with_na)
# [1] NA
# Exclude NA values explicitly
median(values_with_na, na.rm = TRUE)
# [1] 6
This default behavior is intentional—R wants you to consciously decide how to handle missing data rather than silently ignoring it. Always set na.rm = TRUE when you’ve determined that excluding missing values is appropriate for your analysis.
Calculating Median for Data Frames
Real-world data lives in data frames, not isolated vectors. Here’s how to extract medians from structured data.
For a single column, use the $ notation:
# Create sample data frame
df <- data.frame(
product = c("A", "B", "C", "D", "E"),
price = c(29.99, 49.99, 19.99, 99.99, 39.99),
quantity = c(100, 50, 200, 25, 75)
)
# Median of a single column
median(df$price)
# [1] 39.99
median(df$quantity)
# [1] 75
When you need medians across multiple numeric columns, sapply() provides a clean solution:
# Select only numeric columns and calculate median for each
numeric_cols <- df[, sapply(df, is.numeric)]
sapply(numeric_cols, median)
# price quantity
# 39.99 75.00
For matrix-style operations, apply() gives you row-wise or column-wise control:
# Column-wise median (MARGIN = 2)
apply(numeric_cols, 2, median)
# price quantity
# 39.99 75.00
# Row-wise median (MARGIN = 1) - less common but useful for time series
apply(numeric_cols, 1, median)
# [1] 64.995 49.995 109.995 62.495 57.495
Grouped Median Calculations
Calculating median by category is where analysis gets interesting. You’ll compare median prices across product categories, median response times across servers, or median salaries across departments.
The dplyr approach is the most readable:
library(dplyr)
# Sample sales data
sales <- data.frame(
region = c("North", "North", "South", "South", "East", "East", "West", "West"),
revenue = c(45000, 52000, 38000, 41000, 67000, 71000, 33000, 29000)
)
# Grouped median with dplyr
sales %>%
group_by(region) %>%
summarise(
median_revenue = median(revenue),
count = n()
)
# # A tibble: 4 × 3
# region median_revenue count
# <chr> <dbl> <int>
# 1 East 69000 2
# 2 North 48500 2
# 3 South 39500 2
# 4 West 31000 2
If you’re avoiding dependencies, base R’s aggregate() accomplishes the same task:
# Base R grouped median
aggregate(revenue ~ region, data = sales, FUN = median)
# region revenue
# 1 East 69000
# 2 North 48500
# 3 South 39500
# 4 West 31000
The aggregate() syntax uses R’s formula notation: response ~ grouping_variable. For multiple grouping variables, extend the formula: revenue ~ region + year.
Handling Edge Cases
Production code must handle edge cases gracefully. Here’s what you’ll encounter:
# Empty vector
median(numeric(0))
# [1] NA
# Warning message: no non-missing arguments to median
# All NA values
median(c(NA, NA, NA), na.rm = TRUE)
# [1] NA
# Warning message: no non-missing arguments to median
# Single value
median(42)
# [1] 42
# Infinite values
median(c(1, 2, Inf, 4, 5))
# [1] 4
median(c(1, 2, Inf, Inf, 5))
# [1] Inf
Notice that Inf values don’t cause errors—R treats them as valid numeric values that sort to the end. Whether this behavior is appropriate depends on your domain.
For non-numeric data, median() fails:
# Character data
median(c("apple", "banana", "cherry"))
# Error in median.default(c("apple", "banana", "cherry")) :
# need numeric data
# Factor data (works but may not be meaningful)
median(factor(c(1, 2, 3, 4, 5)))
# [1] 3
# Warning message: median is not meaningful for factors
Defensive code should validate input types before calculating:
safe_median <- function(x, na.rm = TRUE) {
if (!is.numeric(x)) {
warning("Input is not numeric, returning NA")
return(NA)
}
if (length(x) == 0 || all(is.na(x))) {
return(NA)
}
median(x, na.rm = na.rm)
}
Manual Median Calculation
Understanding the algorithm helps when you need custom behavior, like weighted medians or streaming calculations:
manual_median <- function(x, na.rm = FALSE) {
# Handle NA values
if (na.rm) {
x <- x[!is.na(x)]
} else if (any(is.na(x))) {
return(NA)
}
# Handle empty vector
n <- length(x)
if (n == 0) return(NA)
# Sort the vector
sorted_x <- sort(x)
# Find middle value(s)
if (n %% 2 == 1) {
# Odd length: return middle element
middle_index <- (n + 1) / 2
return(sorted_x[middle_index])
} else {
# Even length: average of two middle elements
lower_index <- n / 2
upper_index <- lower_index + 1
return((sorted_x[lower_index] + sorted_x[upper_index]) / 2)
}
}
# Verify it matches base R
test_odd <- c(3, 1, 4, 1, 5, 9, 2)
test_even <- c(3, 1, 4, 1, 5, 9)
identical(manual_median(test_odd), median(test_odd))
# [1] TRUE
identical(manual_median(test_even), median(test_even))
# [1] TRUE
The key insight: median is O(n log n) due to the sorting step. For extremely large datasets where you only need an approximate median, consider sampling or streaming algorithms.
Practical Application
Let’s analyze a realistic dataset—employee compensation data:
library(dplyr)
# Simulated employee data
set.seed(42)
employees <- data.frame(
employee_id = 1:500,
department = sample(c("Engineering", "Sales", "Marketing", "Operations"), 500, replace = TRUE),
years_experience = sample(1:20, 500, replace = TRUE),
salary = round(rnorm(500, mean = 75000, sd = 25000))
)
# Add some outliers (executives)
employees$salary[1:5] <- c(350000, 425000, 380000, 290000, 510000)
# Compare mean vs median - see the outlier effect
mean(employees$salary)
# [1] 80347.95
median(employees$salary)
# [1] 74552
# Median salary by department
employees %>%
group_by(department) %>%
summarise(
median_salary = median(salary),
mean_salary = mean(salary),
employee_count = n()
) %>%
arrange(desc(median_salary))
# # A tibble: 4 × 4
# department median_salary mean_salary employee_count
# <chr> <dbl> <dbl> <int>
# 1 Engineering 77696. 88589. 131
# 2 Sales 75282 77553. 117
# 3 Operations 73986 77981. 131
# 4 Marketing 71862 77520. 121
# Median by experience brackets
employees %>%
mutate(experience_bracket = cut(years_experience,
breaks = c(0, 5, 10, 15, 20),
labels = c("0-5", "6-10", "11-15", "16-20"))) %>%
group_by(experience_bracket) %>%
summarise(
median_salary = median(salary),
count = n()
)
# # A tibble: 4 × 3
# experience_bracket median_salary count
# <fct> <dbl> <int>
# 1 0-5 73346. 126
# 2 6-10 75122 133
# 3 11-15 75001 117
# 4 16-20 74406 124
The mean salary ($80,348) overstates typical compensation by $5,800 compared to the median ($74,552). Those five executive salaries skew the mean upward. For communicating “what does a typical employee earn,” median is the honest answer.
The median gives you a robust, interpretable measure of central tendency. Use it whenever outliers might distort your analysis—which, in real-world data, is most of the time.