How to Plot the F Distribution in R
The F distribution is a right-skewed probability distribution that arises when comparing the ratio of two chi-squared random variables, each divided by their respective degrees of freedom. In...
Key Insights
- The F distribution requires two degrees of freedom parameters (df1 and df2), and understanding how they shape the curve is essential for interpreting ANOVA and regression results.
- R provides four core functions for working with the F distribution:
df()for density,pf()for cumulative probability,qf()for quantiles, andrf()for random sampling. - Visualizing the F distribution with shaded rejection regions transforms abstract p-values into intuitive graphical representations that strengthen your statistical reporting.
Introduction to the F Distribution
The F distribution is a right-skewed probability distribution that arises when comparing the ratio of two chi-squared random variables, each divided by their respective degrees of freedom. In practical terms, you encounter it constantly in ANOVA, regression analysis, and any test comparing variances between groups.
Unlike the normal or t distribution, the F distribution takes two parameters: the numerator degrees of freedom (df1) and the denominator degrees of freedom (df2). These aren’t interchangeable—swapping them produces a different distribution. The numerator df typically comes from the number of groups minus one, while the denominator df relates to the total sample size minus the number of groups.
The distribution is bounded at zero on the left (you can’t have negative variance ratios) and extends infinitely to the right, though values become increasingly improbable. As both degrees of freedom increase, the distribution becomes more symmetric and approaches normality.
Understanding F Distribution Functions in R
R provides four built-in functions for working with the F distribution, following the standard naming convention used across all distribution families:
# df(x, df1, df2) - Probability Density Function
# Returns the height of the density curve at point x
df(2, df1 = 5, df2 = 20)
# [1] 0.1329807
# pf(q, df1, df2) - Cumulative Distribution Function
# Returns P(X <= q), the probability of observing a value <= q
pf(2, df1 = 5, df2 = 20)
# [1] 0.8787954
# qf(p, df1, df2) - Quantile Function (inverse of pf)
# Returns the value x such that P(X <= x) = p
qf(0.95, df1 = 5, df2 = 20)
# [1] 2.71089
# rf(n, df1, df2) - Random Generation
# Generates n random values from the distribution
set.seed(42)
rf(5, df1 = 5, df2 = 20)
# [1] 1.0204556 0.5764085 1.1476854 0.8710504 1.8492907
The df() function is what you need for plotting density curves. The pf() function calculates p-values, while qf() finds critical values. Use rf() when you need to simulate F-distributed data.
Creating a Basic F Distribution Plot
The fastest way to plot an F distribution uses R’s curve() function, which evaluates a function across a range and plots the result:
# Simple F distribution plot using curve()
curve(df(x, df1 = 5, df2 = 20),
from = 0,
to = 5,
main = "F Distribution (df1 = 5, df2 = 20)",
xlab = "F Value",
ylab = "Density",
lwd = 2,
col = "steelblue")
For more control, generate the x and y values explicitly:
# Manual approach with plot()
x_values <- seq(0, 5, length.out = 500)
y_values <- df(x_values, df1 = 5, df2 = 20)
plot(x_values, y_values,
type = "l",
main = "F Distribution (df1 = 5, df2 = 20)",
xlab = "F Value",
ylab = "Density",
lwd = 2,
col = "steelblue")
The manual approach is preferable when you need to add shaded regions or work with the data programmatically. Start your x-axis at 0 (not negative values) since the F distribution is undefined for negative numbers.
Customizing Your F Distribution Plot
A publication-ready plot needs more than default settings. Here’s how to create a polished visualization:
# Enhanced F distribution plot
x <- seq(0, 6, length.out = 500)
y <- df(x, df1 = 5, df2 = 20)
# Set up the plot with custom margins
par(mar = c(5, 5, 4, 2))
plot(x, y,
type = "l",
main = "F Distribution with Critical Value",
xlab = "F Statistic",
ylab = "Probability Density",
lwd = 3,
col = "#2C3E50",
las = 1, # Horizontal y-axis labels
bty = "l", # L-shaped box
cex.lab = 1.2,
cex.main = 1.4)
# Add gridlines
grid(col = "gray90", lty = 1)
# Redraw the line on top of gridlines
lines(x, y, lwd = 3, col = "#2C3E50")
# Mark the critical value at alpha = 0.05
critical_value <- qf(0.95, df1 = 5, df2 = 20)
abline(v = critical_value, col = "#E74C3C", lwd = 2, lty = 2)
# Add annotation
text(critical_value + 0.3, 0.5,
paste("Critical Value\n=", round(critical_value, 3)),
col = "#E74C3C",
cex = 0.9)
# Add subtitle with parameters
mtext("df1 = 5, df2 = 20, α = 0.05", side = 3, line = 0.3, cex = 0.9)
The las = 1 argument rotates y-axis labels horizontally for better readability. Using bty = "l" removes the top and right borders, creating a cleaner look favored in scientific publications.
Comparing Multiple F Distributions
Understanding how degrees of freedom affect the distribution shape requires overlaying multiple curves:
# Compare F distributions with different degrees of freedom
x <- seq(0, 5, length.out = 500)
# Define parameter combinations
params <- list(
list(df1 = 1, df2 = 10, col = "#E74C3C", label = "df1=1, df2=10"),
list(df1 = 5, df2 = 10, col = "#3498DB", label = "df1=5, df2=10"),
list(df1 = 10, df2 = 10, col = "#27AE60", label = "df1=10, df2=10"),
list(df1 = 20, df2 = 50, col = "#9B59B6", label = "df1=20, df2=50")
)
# Initialize empty plot
plot(NULL,
xlim = c(0, 5),
ylim = c(0, 0.9),
main = "F Distributions with Varying Degrees of Freedom",
xlab = "F Value",
ylab = "Density",
las = 1)
# Add each distribution
for (p in params) {
lines(x, df(x, df1 = p$df1, df2 = p$df2),
col = p$col, lwd = 2.5)
}
# Add legend
legend("topright",
legend = sapply(params, function(p) p$label),
col = sapply(params, function(p) p$col),
lwd = 2.5,
bty = "n",
cex = 0.9)
Notice how smaller df1 values produce more extreme right skew, while larger degrees of freedom create distributions that peak closer to 1 and appear more symmetric. The F(1, df2) distribution is particularly skewed because it represents the square of a t-distributed random variable.
Plotting with ggplot2
For modern, publication-quality graphics, ggplot2 offers a more flexible approach:
library(ggplot2)
# Basic ggplot2 F distribution
ggplot(data.frame(x = c(0, 5)), aes(x = x)) +
stat_function(fun = df,
args = list(df1 = 5, df2 = 20),
linewidth = 1.2,
color = "#2C3E50") +
labs(title = "F Distribution (df1 = 5, df2 = 20)",
x = "F Value",
y = "Density") +
theme_minimal(base_size = 14)
Adding a shaded rejection region makes hypothesis testing visual:
# F distribution with shaded rejection region
df1 <- 5
df2 <- 20
alpha <- 0.05
critical_val <- qf(1 - alpha, df1, df2)
ggplot(data.frame(x = c(0, 6)), aes(x = x)) +
# Main density curve
stat_function(fun = df,
args = list(df1 = df1, df2 = df2),
linewidth = 1.2,
color = "#2C3E50") +
# Shaded rejection region
stat_function(fun = df,
args = list(df1 = df1, df2 = df2),
geom = "area",
xlim = c(critical_val, 6),
fill = "#E74C3C",
alpha = 0.4) +
# Critical value line
geom_vline(xintercept = critical_val,
linetype = "dashed",
color = "#E74C3C",
linewidth = 1) +
# Annotation
annotate("text", x = critical_val + 0.5, y = 0.4,
label = paste("Critical value =", round(critical_val, 3)),
color = "#E74C3C", hjust = 0) +
annotate("text", x = 4, y = 0.05,
label = paste("α =", alpha),
color = "#E74C3C", size = 5) +
labs(title = "F Distribution with Rejection Region",
subtitle = paste("df1 =", df1, ", df2 =", df2),
x = "F Statistic",
y = "Density") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(face = "bold"))
The stat_function() with geom = "area" creates the shaded region, while xlim restricts it to values beyond the critical threshold.
Practical Application: Visualizing ANOVA Results
The real power of F distribution plots emerges when connected to actual analysis. Here’s a complete workflow that runs an ANOVA and visualizes the result:
# Sample data: plant growth under three treatments
set.seed(123)
treatment_A <- rnorm(15, mean = 20, sd = 3)
treatment_B <- rnorm(15, mean = 23, sd = 3)
treatment_C <- rnorm(15, mean = 22, sd = 3)
# Combine into data frame
plant_data <- data.frame(
growth = c(treatment_A, treatment_B, treatment_C),
treatment = factor(rep(c("A", "B", "C"), each = 15))
)
# Run ANOVA
anova_result <- aov(growth ~ treatment, data = plant_data)
anova_summary <- summary(anova_result)
# Extract F statistic and degrees of freedom
f_stat <- anova_summary[[1]]$`F value`[1]
df1 <- anova_summary[[1]]$Df[1]
df2 <- anova_summary[[1]]$Df[2]
p_value <- anova_summary[[1]]$`Pr(>F)`[1]
# Calculate critical value
critical_val <- qf(0.95, df1, df2)
# Create visualization
library(ggplot2)
ggplot(data.frame(x = c(0, max(f_stat + 2, 6))), aes(x = x)) +
# Density curve
stat_function(fun = df,
args = list(df1 = df1, df2 = df2),
linewidth = 1.2,
color = "#2C3E50") +
# Rejection region
stat_function(fun = df,
args = list(df1 = df1, df2 = df2),
geom = "area",
xlim = c(critical_val, max(f_stat + 2, 6)),
fill = "#E74C3C",
alpha = 0.3) +
# Observed F statistic
geom_vline(xintercept = f_stat,
color = "#27AE60",
linewidth = 1.5) +
# Critical value
geom_vline(xintercept = critical_val,
linetype = "dashed",
color = "#E74C3C",
linewidth = 1) +
# Annotations
annotate("text", x = f_stat + 0.15, y = 0.5,
label = paste("Observed F =", round(f_stat, 3)),
color = "#27AE60", hjust = 0, fontface = "bold") +
annotate("text", x = critical_val + 0.1, y = 0.35,
label = paste("Critical F =", round(critical_val, 3)),
color = "#E74C3C", hjust = 0) +
annotate("label", x = 0.5, y = 0.6,
label = paste("p-value =", format(p_value, digits = 4)),
fill = "white", size = 4) +
labs(title = "ANOVA Result Visualization",
subtitle = paste("F(", df1, ",", df2, ") distribution | α = 0.05"),
x = "F Statistic",
y = "Density") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(face = "bold"))
This visualization immediately communicates whether your observed F statistic falls in the rejection region. The green line shows where your data landed, the red dashed line marks the decision boundary, and the shaded area represents the probability of observing an F value that extreme under the null hypothesis.
When your observed F statistic lands in the shaded region, you reject the null hypothesis. When it falls to the left of the critical value, you fail to reject. This visual representation makes statistical results accessible to audiences who may struggle with numerical p-values alone.