Normal Distribution in R: Complete Guide

The normal distribution—the bell curve—underpins most of classical statistics. It describes everything from measurement errors to human heights to stock returns. Understanding how to work with it in...

Key Insights

  • R provides four core functions for the normal distribution (dnorm, pnorm, qnorm, rnorm) that follow a consistent naming pattern used across all probability distributions in the language.
  • The pnorm() and qnorm() functions handle 90% of practical statistical work—calculating probabilities from values and finding cutoff values from probabilities.
  • Always test your data for normality before applying parametric methods; the Shapiro-Wilk test combined with Q-Q plots gives you both statistical rigor and visual confirmation.

Introduction to the Normal Distribution

The normal distribution—the bell curve—underpins most of classical statistics. It describes everything from measurement errors to human heights to stock returns. Understanding how to work with it in R is fundamental to statistical programming.

The normal distribution is defined by two parameters: the mean (μ), which determines where the curve centers, and the standard deviation (σ), which controls the spread. The standard normal distribution has μ = 0 and σ = 1.

R handles normal distributions elegantly through a consistent function interface. Once you learn this pattern, you can apply it to dozens of other distributions.

R’s Built-in Normal Distribution Functions

R uses a four-function convention for every probability distribution. For the normal distribution:

Function Purpose Mnemonic
dnorm() Density (height of curve) density
pnorm() Cumulative probability probability
qnorm() Quantile (inverse of pnorm) quantile
rnorm() Random sample generation random

Here’s the basic syntax for each:

# dnorm: probability density at x
dnorm(x = 0, mean = 0, sd = 1)
# [1] 0.3989423

# pnorm: P(X <= x), cumulative probability
pnorm(q = 1.96, mean = 0, sd = 1)
# [1] 0.9750021

# qnorm: find x such that P(X <= x) = p
qnorm(p = 0.975, mean = 0, sd = 1)
# [1] 1.959964

# rnorm: generate random values
set.seed(42)
rnorm(n = 5, mean = 100, sd = 15)
# [1] 120.57058  91.53429 105.44647 109.47496 110.61523

The dnorm() function returns the height of the probability density function—useful for plotting but rarely for calculations. The other three functions do the heavy lifting in practical statistics.

Generating and Visualizing Random Normal Data

Generating normally distributed data is straightforward with rnorm(). Always set a seed for reproducibility:

set.seed(123)
sample_data <- rnorm(n = 1000, mean = 50, sd = 10)

# Quick summary
summary(sample_data)
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   17.95   43.26   50.08   50.09   56.82   81.54

For visualization, start with a histogram and overlay the theoretical density:

# Base R approach
hist(sample_data, 
     breaks = 30, 
     probability = TRUE,
     main = "Sample vs Theoretical Normal",
     xlab = "Value",
     col = "lightblue",
     border = "white")

# Overlay theoretical curve
curve(dnorm(x, mean = 50, sd = 10), 
      add = TRUE, 
      col = "red", 
      lwd = 2)

For publication-quality graphics, use ggplot2:

library(ggplot2)

df <- data.frame(value = sample_data)

ggplot(df, aes(x = value)) +
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 30, 
                 fill = "steelblue", 
                 color = "white",
                 alpha = 0.7) +
  stat_function(fun = dnorm, 
                args = list(mean = 50, sd = 10),
                color = "red", 
                linewidth = 1) +
  labs(title = "Sample Distribution with Theoretical Overlay",
       x = "Value", 
       y = "Density") +
  theme_minimal()

Probability Calculations with pnorm() and qnorm()

These two functions solve inverse problems. pnorm() answers “what’s the probability of getting a value this extreme?” while qnorm() answers “what value corresponds to this probability?”

Finding Probabilities

# P(X < 60) for X ~ N(50, 10)
pnorm(60, mean = 50, sd = 10)
# [1] 0.8413447

# P(X > 60) - use lower.tail = FALSE
pnorm(60, mean = 50, sd = 10, lower.tail = FALSE)
# [1] 0.1586553

# P(40 < X < 60) - subtract cumulative probabilities
pnorm(60, mean = 50, sd = 10) - pnorm(40, mean = 50, sd = 10)
# [1] 0.6826895

That last result—approximately 68%—confirms the empirical rule: about 68% of data falls within one standard deviation of the mean.

Finding Quantiles

# What value has 90% of the distribution below it?
qnorm(0.90, mean = 50, sd = 10)
# [1] 62.81552

# Find the 95th percentile of standard normal
qnorm(0.95)
# [1] 1.644854

# Find symmetric bounds containing 95% of data
qnorm(c(0.025, 0.975), mean = 50, sd = 10)
# [1] 30.40036 69.59964

Testing for Normality

Before applying parametric tests, verify your data approximates normality. Use both statistical tests and visual methods.

Shapiro-Wilk Test

The Shapiro-Wilk test is the most powerful normality test for samples under 5000 observations:

# Test our generated normal data
shapiro.test(sample_data)
# W = 0.99883, p-value = 0.7621

# Test clearly non-normal data
skewed_data <- rexp(1000, rate = 0.5)
shapiro.test(skewed_data)
# W = 0.85234, p-value < 2.2e-16

A high p-value (> 0.05) means you cannot reject the null hypothesis that data is normally distributed. A low p-value indicates significant departure from normality.

Q-Q Plots

Q-Q plots provide visual assessment. Points should fall along the diagonal line for normal data:

# Normal data - should be linear
par(mfrow = c(1, 2))

qqnorm(sample_data, main = "Normal Data")
qqline(sample_data, col = "red", lwd = 2)

# Skewed data - will show curvature
qqnorm(skewed_data, main = "Skewed Data")
qqline(skewed_data, col = "red", lwd = 2)

par(mfrow = c(1, 1))

Deviations at the tails indicate heavy or light tails. S-shaped patterns suggest skewness. Trust the visual pattern over p-values for large samples, where Shapiro-Wilk becomes overly sensitive.

Standardization and Z-Scores

Z-scores express how many standard deviations a value lies from the mean. They allow comparison across different scales.

Manual Calculation

# z = (x - mean) / sd
raw_score <- 75
mean_score <- 50
sd_score <- 10

z_score <- (raw_score - mean_score) / sd_score
z_score
# [1] 2.5

# Interpretation: 75 is 2.5 standard deviations above the mean
# What percentile is this?
pnorm(z_score)
# [1] 0.9937903
# 99.4th percentile

Using scale()

For vectors or data frames, scale() standardizes efficiently:

# Standardize a vector
standardized <- scale(sample_data)

# Verify: mean ≈ 0, sd ≈ 1
mean(standardized)
# [1] -1.598721e-16 (essentially 0)
sd(standardized)
# [1] 1

# Standardize specific columns in a data frame
df <- data.frame(
  height = c(165, 170, 175, 180, 185),
  weight = c(60, 70, 75, 85, 90)
)

df_scaled <- as.data.frame(scale(df))
df_scaled

Real-World Applications

Confidence Intervals

Calculate a 95% confidence interval for a population mean:

calculate_ci <- function(data, confidence = 0.95) {
  n <- length(data)
  mean_val <- mean(data)
  se <- sd(data) / sqrt(n)
  
  # For large samples, use z-score
  z <- qnorm((1 + confidence) / 2)
  
  margin <- z * se
  
  list(
    mean = mean_val,
    lower = mean_val - margin,
    upper = mean_val + margin,
    margin_of_error = margin
  )
}

# Apply to our sample
ci <- calculate_ci(sample_data)
cat(sprintf("95%% CI: [%.2f, %.2f]\n", ci$lower, ci$upper))
# 95% CI: [49.47, 50.71]

One-Sample Z-Test

Test whether a sample mean differs from a hypothesized population mean:

z_test <- function(sample, mu_0, sigma, alternative = "two.sided") {
  n <- length(sample)
  sample_mean <- mean(sample)
  z <- (sample_mean - mu_0) / (sigma / sqrt(n))
  
  p_value <- switch(alternative,
    "two.sided" = 2 * pnorm(-abs(z)),
    "less" = pnorm(z),
    "greater" = pnorm(z, lower.tail = FALSE)
  )
  
  list(
    z_statistic = z,
    p_value = p_value,
    sample_mean = sample_mean,
    null_mean = mu_0
  )
}

# Test if our sample mean differs from 50
result <- z_test(sample_data, mu_0 = 50, sigma = 10)
cat(sprintf("Z = %.3f, p = %.4f\n", result$z_statistic, result$p_value))

Quality Control Example

In manufacturing, specifications often define acceptable ranges. Calculate the proportion of products outside tolerance:

# Product weight: target 100g, tolerance ±3g
# Process produces weights ~ N(100.5, 1.2)

# Proportion below lower spec (97g)
below_spec <- pnorm(97, mean = 100.5, sd = 1.2)

# Proportion above upper spec (103g)
above_spec <- pnorm(103, mean = 100.5, sd = 1.2, lower.tail = FALSE)

# Total defect rate
defect_rate <- below_spec + above_spec
cat(sprintf("Defect rate: %.2f%%\n", defect_rate * 100))
# Defect rate: 1.89%

The normal distribution functions in R are the foundation for statistical analysis. Master pnorm() and qnorm() first—they handle the probability calculations you’ll need most often. Combine them with normality testing to ensure your statistical methods are appropriate for your data.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.