How to Calculate Variance in R

Key Insights

R’s built-in var() function calculates sample variance (dividing by n-1) by default—multiply by (n-1)/n to get population variance when you have complete data
Always use na.rm = TRUE when working with real-world datasets to prevent NA values from returning NA for your entire variance calculation
Combine dplyr’s group_by() and summarise() for efficient grouped variance calculations that integrate cleanly into modern data analysis workflows

Introduction to Variance

Variance quantifies how spread out your data points are from the mean. It’s one of the most fundamental measures of dispersion in statistics, serving as the foundation for standard deviation, hypothesis testing, ANOVA, and countless machine learning algorithms.

When you calculate variance, you’re answering a simple question: how much do individual observations deviate from the average? A low variance indicates data points cluster tightly around the mean. A high variance means they’re scattered widely.

In R, you have multiple ways to calculate variance depending on your data structure and analysis needs. This article covers all of them—from manual calculations that reinforce the underlying math to modern tidyverse approaches for production code.

Understanding the Variance Formula

Before diving into R functions, you need to understand what you’re calculating. There are two variance formulas, and confusing them is a common source of errors.

Population variance (σ²) applies when you have data for an entire population:

$$\sigma^2 = \frac{\sum_{i=1}^{N}(x_i - \mu)^2}{N}$$

Sample variance (s²) applies when you’re working with a sample and want to estimate the population variance:

$$s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}$$

The difference is the denominator: N for population, n-1 for sample. The n-1 correction (called Bessel’s correction) produces an unbiased estimator of population variance when working with samples.

Here’s how to calculate both manually in R:

# Sample data
data <- c(4, 8, 6, 5, 3, 2, 8, 9, 5, 10)

# Calculate the mean
mean_val <- mean(data)
n <- length(data)

# Calculate squared deviations from the mean
squared_deviations <- (data - mean_val)^2

# Sample variance (divide by n-1)
sample_variance <- sum(squared_deviations) / (n - 1)
print(sample_variance)
# [1] 6.888889

# Population variance (divide by n)
population_variance <- sum(squared_deviations) / n
print(population_variance)
# [1] 6.2

Understanding this manual calculation helps you debug issues and verify results. In practice, you’ll use R’s built-in functions.

Using the Built-in var() Function

R’s var() function calculates sample variance by default. This is the right choice for most statistical work since you’re typically analyzing samples, not complete populations.

# Basic usage
data <- c(4, 8, 6, 5, 3, 2, 8, 9, 5, 10)

# Sample variance using var()
sample_var <- var(data)
print(sample_var)
# [1] 6.888889

# Verify it matches our manual calculation
manual_var <- sum((data - mean(data))^2) / (length(data) - 1)
print(sample_var == manual_var)
# [1] TRUE

When you genuinely have population data—census data, complete transaction records, or simulation outputs—you need to adjust the result:

# Population variance from var()
n <- length(data)
pop_var <- var(data) * (n - 1) / n
print(pop_var)
# [1] 6.2

# Or create a reusable function
pop_variance <- function(x, na.rm = FALSE) {
  if (na.rm) x <- x[!is.na(x)]
  n <- length(x)
  var(x) * (n - 1) / n
}

print(pop_variance(data))
# [1] 6.2

The custom function approach is cleaner for repeated use and handles NA values properly.

Calculating Variance for Data Frames and Matrices

Real analysis rarely involves single vectors. You’ll typically work with data frames containing multiple variables. Here’s how to calculate variance across different data structures.

Variance for data frame columns:

# Using mtcars as an example
df <- mtcars[, c("mpg", "hp", "wt", "qsec")]

# var() on a data frame returns a covariance matrix
cov_matrix <- var(df)
print(cov_matrix)
#             mpg          hp         wt       qsec
# mpg    36.324103  -320.73206  -5.116685   4.509149
# hp   -320.732056 4700.86694  44.192661 -86.770081
# wt     -5.116685   44.19266   0.957379  -0.305261
# qsec    4.509149  -86.77008  -0.305261   3.193166

# Extract just the variances (diagonal elements)
variances <- diag(var(df))
print(variances)
#        mpg         hp         wt       qsec 
#  36.324103 4700.866935    0.957379   3.193166

Using sapply() for column-wise variance:

# More explicit approach
col_variances <- sapply(df, var)
print(col_variances)
#        mpg         hp         wt       qsec 
#  36.324103 4700.866935    0.957379   3.193166

Using apply() for matrices:

# Create a matrix
mat <- matrix(1:20, nrow = 4, ncol = 5)
print(mat)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    5    9   13   17
# [2,]    2    6   10   14   18
# [3,]    3    7   11   15   19
# [4,]    4    8   12   16   20

# Variance by column (MARGIN = 2)
col_var <- apply(mat, 2, var)
print(col_var)
# [1] 1.666667 1.666667 1.666667 1.666667 1.666667

# Variance by row (MARGIN = 1)
row_var <- apply(mat, 1, var)
print(row_var)
# [1] 40 40 40 40

The MARGIN argument determines the direction: 1 for rows, 2 for columns. This pattern works with any function, not just var().

Handling Missing Values

Real datasets contain missing values. By default, var() returns NA if any element is NA—a safe behavior that forces you to make an explicit decision about how to handle missingness.

# Data with missing values
data_with_na <- c(4, 8, NA, 5, 3, 2, 8, NA, 5, 10)

# Default behavior returns NA
print(var(data_with_na))
# [1] NA

# Use na.rm = TRUE to exclude NA values
var_clean <- var(data_with_na, na.rm = TRUE)
print(var_clean)
# [1] 7.410714

# Verify by manually removing NAs
data_clean <- data_with_na[!is.na(data_with_na)]
print(var(data_clean))
# [1] 7.410714

Important considerations when removing NAs:

The variance is calculated on fewer observations, which affects precision
If NAs aren’t missing at random, your variance estimate may be biased
Document your missing data strategy for reproducibility

# Check how many values you're actually using
data_with_na <- c(4, 8, NA, 5, 3, 2, 8, NA, 5, 10)

n_total <- length(data_with_na)
n_valid <- sum(!is.na(data_with_na))
n_missing <- sum(is.na(data_with_na))

cat("Total observations:", n_total, "\n")
cat("Valid observations:", n_valid, "\n")
cat("Missing observations:", n_missing, 
    "(", round(n_missing/n_total * 100, 1), "%)\n")
# Total observations: 10 
# Valid observations: 8 
# Missing observations: 2 ( 20 %)

For data frames with multiple columns, sapply() respects the na.rm argument:

# Create data frame with missing values
df_na <- data.frame(
  a = c(1, 2, NA, 4, 5),
  b = c(NA, 2, 3, 4, 5),
  c = c(1, 2, 3, 4, NA)
)

# Calculate variance for each column, removing NAs
variances <- sapply(df_na, var, na.rm = TRUE)
print(variances)
#        a        b        c 
# 2.916667 1.666667 1.666667

Variance with dplyr and tidyverse

Modern R workflows use the tidyverse for data manipulation. The dplyr package provides summarise() for aggregations and group_by() for grouped calculations—a powerful combination for variance analysis.

library(dplyr)

# Basic variance calculation
mtcars %>%
  summarise(
    mpg_variance = var(mpg),
    hp_variance = var(hp),
    wt_variance = var(wt)
  )
#   mpg_variance hp_variance wt_variance
# 1     36.32410    4700.867   0.9573790

Grouped variance calculations:

# Variance by number of cylinders
mtcars %>%
  group_by(cyl) %>%
  summarise(
    n = n(),
    mpg_mean = mean(mpg),
    mpg_var = var(mpg),
    mpg_sd = sd(mpg),
    .groups = "drop"
  )
# # A tibble: 3 × 5
#     cyl     n mpg_mean mpg_var mpg_sd
#   <dbl> <int>    <dbl>   <dbl>  <dbl>
# 1     4    11     26.7   20.3    4.51
# 2     6     7     19.7    2.11   1.45
# 3     8    14     15.1    6.55   2.56

This reveals something important: 4-cylinder cars have the highest variance in fuel efficiency, while 6-cylinder cars are remarkably consistent.

Using across() for multiple columns:

# Calculate variance for multiple numeric columns at once
mtcars %>%
  group_by(cyl) %>%
  summarise(
    across(c(mpg, hp, wt), 
           list(var = var, sd = sd), 
           .names = "{.col}_{.fn}"),
    .groups = "drop"
  )
# # A tibble: 3 × 7
#     cyl mpg_var mpg_sd hp_var hp_sd wt_var wt_sd
#   <dbl>   <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
# 1     4   20.3    4.51   670. 25.9  0.109  0.330
# 2     6    2.11   1.45   360. 19.0  0.0270 0.164
# 3     8    6.55   2.56  2578. 50.8  0.0744 0.273

Handling NAs in dplyr:

# With missing values
df_missing <- tibble(
  group = rep(c("A", "B"), each = 5),
  value = c(1, 2, NA, 4, 5, 6, NA, 8, 9, 10)
)

df_missing %>%
  group_by(group) %>%
  summarise(
    n_total = n(),
    n_valid = sum(!is.na(value)),
    variance = var(value, na.rm = TRUE),
    .groups = "drop"
  )
# # A tibble: 2 × 4
#   group n_total n_valid variance
#   <chr>   <int>   <int>    <dbl>
# 1 A           5       4     2.92
# 2 B           5       4     2.92

Practical Applications and Next Steps

Variance rarely stands alone in analysis. Here’s how it connects to related functions and common use cases:

Standard deviation is the square root of variance, returning to the original units:

data <- c(4, 8, 6, 5, 3, 2, 8, 9, 5, 10)
print(sqrt(var(data)))
# [1] 2.624669
print(sd(data))
# [1] 2.624669

Covariance measures how two variables vary together:

# Covariance between mpg and weight
cov(mtcars$mpg, mtcars$wt)
# [1] -5.116685

Coefficient of variation normalizes variance for comparison across different scales:

cv <- function(x, na.rm = FALSE) {
  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm) * 100
}

# Compare variability across different-scaled variables
sapply(mtcars[, c("mpg", "hp", "wt")], cv)
#       mpg        hp        wt 
# 29.99881 32.04165 30.41041

Use variance when you need to quantify spread for statistical tests, compare variability between groups, assess model residuals, or feed into algorithms that require variance estimates. The techniques in this article give you the tools to calculate it correctly in any R workflow.