How to Calculate the Median in R

Key Insights

R’s built-in median() function handles most use cases, but you must explicitly set na.rm = TRUE to avoid NA results when your data contains missing values
For grouped median calculations, dplyr::summarise() combined with group_by() provides the cleanest syntax, though base R’s aggregate() works without dependencies
Understanding the manual calculation (sorting, then selecting the middle value or averaging two middle values) helps you debug edge cases and implement custom weighted median functions

Introduction to the Median

The median represents the middle value in a sorted dataset. When you arrange your data from smallest to largest, the median sits exactly at the center—half the values fall below it, half above. For odd-length datasets, it’s the literal middle value. For even-length datasets, it’s the average of the two middle values.

Why use median instead of mean? Robustness to outliers. Consider salaries at a small company: five employees earn $50,000, and the CEO earns $2,000,000. The mean salary is $375,000—a number that describes nobody’s actual compensation. The median is $50,000, which accurately represents the typical employee’s pay.

This robustness makes median essential for analyzing income distributions, housing prices, response times, and any dataset where extreme values might distort your understanding of “typical” behavior.

Using the Built-in median() Function

R provides median() as part of its base statistics functions. The syntax is straightforward:

# Basic median calculation
values <- c(3, 7, 2, 9, 5)
median(values)
# [1] 5

# With an even number of elements
values_even <- c(3, 7, 2, 9, 5, 11)
median(values_even)
# [1] 6

In the even-length example, R sorts the values to c(2, 3, 5, 7, 9, 11), identifies the two middle values (5 and 7), and returns their average (6).

The critical parameter you’ll use constantly is na.rm. By default, median() returns NA if any value is missing:

# Missing values cause NA result by default
values_with_na <- c(3, 7, NA, 9, 5)
median(values_with_na)
# [1] NA

# Exclude NA values explicitly
median(values_with_na, na.rm = TRUE)
# [1] 6

This default behavior is intentional—R wants you to consciously decide how to handle missing data rather than silently ignoring it. Always set na.rm = TRUE when you’ve determined that excluding missing values is appropriate for your analysis.

Calculating Median for Data Frames

Real-world data lives in data frames, not isolated vectors. Here’s how to extract medians from structured data.

For a single column, use the $ notation:

# Create sample data frame
df <- data.frame(
  product = c("A", "B", "C", "D", "E"),
  price = c(29.99, 49.99, 19.99, 99.99, 39.99),
  quantity = c(100, 50, 200, 25, 75)
)

# Median of a single column
median(df$price)
# [1] 39.99

median(df$quantity)
# [1] 75

When you need medians across multiple numeric columns, sapply() provides a clean solution:

# Select only numeric columns and calculate median for each
numeric_cols <- df[, sapply(df, is.numeric)]
sapply(numeric_cols, median)
#   price quantity 
#   39.99    75.00

For matrix-style operations, apply() gives you row-wise or column-wise control:

# Column-wise median (MARGIN = 2)
apply(numeric_cols, 2, median)
#   price quantity 
#   39.99    75.00

# Row-wise median (MARGIN = 1) - less common but useful for time series
apply(numeric_cols, 1, median)
# [1] 64.995 49.995 109.995 62.495 57.495

Grouped Median Calculations

Calculating median by category is where analysis gets interesting. You’ll compare median prices across product categories, median response times across servers, or median salaries across departments.

The dplyr approach is the most readable:

library(dplyr)

# Sample sales data
sales <- data.frame(
  region = c("North", "North", "South", "South", "East", "East", "West", "West"),
  revenue = c(45000, 52000, 38000, 41000, 67000, 71000, 33000, 29000)
)

# Grouped median with dplyr
sales %>%
  group_by(region) %>%
  summarise(
    median_revenue = median(revenue),
    count = n()
  )
# # A tibble: 4 × 3
#   region median_revenue count
#   <chr>           <dbl> <int>
# 1 East           69000      2
# 2 North          48500      2
# 3 South          39500      2
# 4 West           31000      2

If you’re avoiding dependencies, base R’s aggregate() accomplishes the same task:

# Base R grouped median
aggregate(revenue ~ region, data = sales, FUN = median)
#   region revenue
# 1   East   69000
# 2  North   48500
# 3  South   39500
# 4   West   31000

The aggregate() syntax uses R’s formula notation: response ~ grouping_variable. For multiple grouping variables, extend the formula: revenue ~ region + year.

Handling Edge Cases

Production code must handle edge cases gracefully. Here’s what you’ll encounter:

# Empty vector
median(numeric(0))
# [1] NA
# Warning message: no non-missing arguments to median

# All NA values
median(c(NA, NA, NA), na.rm = TRUE)
# [1] NA
# Warning message: no non-missing arguments to median

# Single value
median(42)
# [1] 42

# Infinite values
median(c(1, 2, Inf, 4, 5))
# [1] 4

median(c(1, 2, Inf, Inf, 5))
# [1] Inf

Notice that Inf values don’t cause errors—R treats them as valid numeric values that sort to the end. Whether this behavior is appropriate depends on your domain.

For non-numeric data, median() fails:

# Character data
median(c("apple", "banana", "cherry"))
# Error in median.default(c("apple", "banana", "cherry")) : 
#   need numeric data

# Factor data (works but may not be meaningful)
median(factor(c(1, 2, 3, 4, 5)))
# [1] 3
# Warning message: median is not meaningful for factors

Defensive code should validate input types before calculating:

safe_median <- function(x, na.rm = TRUE) {
  if (!is.numeric(x)) {
    warning("Input is not numeric, returning NA")
    return(NA)
  }
  if (length(x) == 0 || all(is.na(x))) {
    return(NA)
  }
  median(x, na.rm = na.rm)
}

Manual Median Calculation

Understanding the algorithm helps when you need custom behavior, like weighted medians or streaming calculations:

manual_median <- function(x, na.rm = FALSE) {
  # Handle NA values
  if (na.rm) {
    x <- x[!is.na(x)]
  } else if (any(is.na(x))) {
    return(NA)
  }
  
  # Handle empty vector
  n <- length(x)
  if (n == 0) return(NA)
  
  # Sort the vector
  sorted_x <- sort(x)
  
  # Find middle value(s)
  if (n %% 2 == 1) {
    # Odd length: return middle element
    middle_index <- (n + 1) / 2
    return(sorted_x[middle_index])
  } else {
    # Even length: average of two middle elements
    lower_index <- n / 2
    upper_index <- lower_index + 1
    return((sorted_x[lower_index] + sorted_x[upper_index]) / 2)
  }
}

# Verify it matches base R
test_odd <- c(3, 1, 4, 1, 5, 9, 2)
test_even <- c(3, 1, 4, 1, 5, 9)

identical(manual_median(test_odd), median(test_odd))
# [1] TRUE

identical(manual_median(test_even), median(test_even))
# [1] TRUE

The key insight: median is O(n log n) due to the sorting step. For extremely large datasets where you only need an approximate median, consider sampling or streaming algorithms.

Practical Application

Let’s analyze a realistic dataset—employee compensation data:

library(dplyr)

# Simulated employee data
set.seed(42)
employees <- data.frame(
  employee_id = 1:500,
  department = sample(c("Engineering", "Sales", "Marketing", "Operations"), 500, replace = TRUE),
  years_experience = sample(1:20, 500, replace = TRUE),
  salary = round(rnorm(500, mean = 75000, sd = 25000))
)

# Add some outliers (executives)
employees$salary[1:5] <- c(350000, 425000, 380000, 290000, 510000)

# Compare mean vs median - see the outlier effect
mean(employees$salary)
# [1] 80347.95

median(employees$salary)
# [1] 74552

# Median salary by department
employees %>%
  group_by(department) %>%
  summarise(
    median_salary = median(salary),
    mean_salary = mean(salary),
    employee_count = n()
  ) %>%
  arrange(desc(median_salary))
# # A tibble: 4 × 4
#   department  median_salary mean_salary employee_count
#   <chr>               <dbl>       <dbl>          <int>
# 1 Engineering        77696.      88589.            131
# 2 Sales              75282       77553.            117
# 3 Operations         73986       77981.            131
# 4 Marketing          71862       77520.            121

# Median by experience brackets
employees %>%
  mutate(experience_bracket = cut(years_experience, 
                                   breaks = c(0, 5, 10, 15, 20),
                                   labels = c("0-5", "6-10", "11-15", "16-20"))) %>%
  group_by(experience_bracket) %>%
  summarise(
    median_salary = median(salary),
    count = n()
  )
# # A tibble: 4 × 3
#   experience_bracket median_salary count
#   <fct>                      <dbl> <int>
# 1 0-5                       73346.   126
# 2 6-10                      75122    133
# 3 11-15                     75001    117
# 4 16-20                     74406    124

The mean salary ($80,348) overstates typical compensation by $5,800 compared to the median ($74,552). Those five executive salaries skew the mean upward. For communicating “what does a typical employee earn,” median is the honest answer.

The median gives you a robust, interpretable measure of central tendency. Use it whenever outliers might distort your analysis—which, in real-world data, is most of the time.