How to Perform the Kolmogorov-Smirnov Test in R
The Kolmogorov-Smirnov (K-S) test is a nonparametric test that compares probability distributions. Unlike tests that focus on specific moments like mean or variance, the K-S test examines the entire...
Key Insights
- The Kolmogorov-Smirnov test compares distributions by measuring the maximum vertical distance between cumulative distribution functions, making it useful for both normality testing and comparing two empirical samples.
- Always use the Lilliefors correction when testing normality with estimated parameters—the standard K-S test produces inflated p-values when you estimate mean and standard deviation from the same data.
- The K-S test struggles with tied values in discrete data; use the
exact = FALSEparameter to suppress warnings and get asymptotic p-values when working with rounded or categorical-like continuous data.
Introduction to the Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov (K-S) test is a nonparametric test that compares probability distributions. Unlike tests that focus on specific moments like mean or variance, the K-S test examines the entire shape of distributions by comparing their cumulative distribution functions (CDFs).
The test comes in two variants. The one-sample K-S test compares your data against a theoretical distribution—testing whether your sample could have been drawn from a normal, uniform, exponential, or any other specified distribution. The two-sample K-S test compares two empirical samples to determine if they come from the same underlying distribution.
When should you reach for the K-S test instead of alternatives like Shapiro-Wilk or Anderson-Darling? The K-S test shines when you need distribution-free comparisons and when you’re comparing two samples rather than testing against a theoretical distribution. It’s also useful when you care about the entire distribution shape, not just normality. However, for pure normality testing with a single sample, Shapiro-Wilk typically has more statistical power.
Prerequisites and Setup
The K-S test lives in R’s built-in stats package, so you don’t need to install anything for basic functionality. For visualization and the Lilliefors correction, you’ll want a few additional packages.
# Core functionality - already loaded with base R
# library(stats)
# For visualization
library(ggplot2)
# For Lilliefors test (K-S with estimated parameters)
library(nortest)
# Generate sample data for examples
set.seed(42)
normal_data <- rnorm(100, mean = 50, sd = 10)
skewed_data <- rexp(100, rate = 0.1)
comparison_data <- rnorm(100, mean = 52, sd = 10)
One-Sample K-S Test: Testing Against a Known Distribution
The one-sample test answers this question: “Could my data have come from this specific distribution?” The test calculates the D statistic—the maximum absolute difference between your sample’s empirical CDF and the theoretical CDF.
# Test if data follows a standard normal distribution
result <- ks.test(normal_data, "pnorm", mean = 50, sd = 10)
print(result)
Exact one-sample Kolmogorov-Smirnov test
data: normal_data
D = 0.063385, p-value = 0.8037
alternative hypothesis: two-sided
The output gives you two key values. The D statistic (0.063) represents the maximum vertical distance between the empirical and theoretical CDFs. Smaller values indicate better fit. The p-value (0.80) tells you the probability of observing a D statistic this extreme if the data truly came from the specified distribution. A high p-value means you cannot reject the null hypothesis—the data is consistent with the theoretical distribution.
You can test against any distribution R knows about:
# Test against uniform distribution
uniform_sample <- runif(100, min = 0, max = 1)
ks.test(uniform_sample, "punif", min = 0, max = 1)
# Test against exponential distribution
exp_sample <- rexp(100, rate = 2)
ks.test(exp_sample, "pexp", rate = 2)
# Test skewed data against normal - should reject
ks.test(skewed_data, "pnorm", mean = mean(skewed_data), sd = sd(skewed_data))
Two-Sample K-S Test: Comparing Two Datasets
The two-sample variant compares two empirical distributions directly. This is invaluable for A/B testing, before/after comparisons, or validating that two data sources follow the same distribution.
# Compare two samples
two_sample_result <- ks.test(normal_data, comparison_data)
print(two_sample_result)
Asymptotic two-sample Kolmogorov-Smirnov test
data: normal_data and comparison_data
D = 0.13, p-value = 0.3521
alternative hypothesis: two-sided
The interpretation changes slightly here. A high p-value (0.35) indicates insufficient evidence to conclude the distributions differ. The samples could plausibly come from the same underlying distribution.
Here’s a practical A/B testing scenario:
# Simulated response times (milliseconds) from two server configurations
set.seed(123)
server_a_times <- rgamma(150, shape = 2, rate = 0.01) # Original config
server_b_times <- rgamma(150, shape = 2.5, rate = 0.012) # New config
ab_result <- ks.test(server_a_times, server_b_times)
print(ab_result)
# Calculate practical metrics alongside
cat("\nServer A - Mean:", round(mean(server_a_times), 1), "ms\n")
cat("Server B - Mean:", round(mean(server_b_times), 1), "ms\n")
cat("Distribution difference detected:", ab_result$p.value < 0.05, "\n")
Visualizing K-S Test Results
Numbers tell part of the story; visualization tells the rest. Plotting empirical CDFs makes the D statistic intuitive—it’s literally the largest vertical gap between the curves.
# Create ECDF plot comparing two samples
library(ggplot2)
# Combine data for plotting
plot_data <- data.frame(
value = c(normal_data, comparison_data),
group = rep(c("Sample A", "Sample B"), each = 100)
)
# Calculate D statistic location for annotation
ecdf_a <- ecdf(normal_data)
ecdf_b <- ecdf(comparison_data)
all_values <- sort(unique(c(normal_data, comparison_data)))
differences <- abs(ecdf_a(all_values) - ecdf_b(all_values))
max_diff_idx <- which.max(differences)
max_diff_x <- all_values[max_diff_idx]
max_diff_y1 <- ecdf_a(max_diff_x)
max_diff_y2 <- ecdf_b(max_diff_x)
# Create the plot
ggplot(plot_data, aes(x = value, color = group)) +
stat_ecdf(linewidth = 1) +
geom_segment(
aes(x = max_diff_x, xend = max_diff_x,
y = max_diff_y1, yend = max_diff_y2),
color = "red", linewidth = 1.5, linetype = "dashed"
) +
annotate("text", x = max_diff_x + 3, y = (max_diff_y1 + max_diff_y2) / 2,
label = paste("D =", round(max(differences), 3)),
color = "red", fontface = "bold") +
labs(
title = "Empirical CDFs with K-S D Statistic",
x = "Value",
y = "Cumulative Probability",
color = "Sample"
) +
theme_minimal() +
theme(legend.position = "bottom")
For one-sample tests, overlay the theoretical CDF:
# One-sample visualization
ggplot(data.frame(x = normal_data), aes(x = x)) +
stat_ecdf(aes(color = "Empirical"), linewidth = 1) +
stat_function(
fun = pnorm,
args = list(mean = 50, sd = 10),
aes(color = "Theoretical Normal"),
linewidth = 1
) +
labs(
title = "Sample vs. Theoretical Normal Distribution",
x = "Value",
y = "Cumulative Probability",
color = "Distribution"
) +
theme_minimal()
Limitations and Common Pitfalls
The K-S test has several gotchas that trip up practitioners.
Parameter estimation bias: When you estimate distribution parameters from your data (like using mean(x) and sd(x) for a normality test), the standard K-S test produces anti-conservative p-values. Use the Lilliefors test instead:
library(nortest)
# Wrong approach - inflated p-values
wrong_result <- ks.test(normal_data, "pnorm",
mean = mean(normal_data),
sd = sd(normal_data))
# Correct approach - Lilliefors test
correct_result <- lillie.test(normal_data)
cat("Standard K-S p-value:", wrong_result$p.value, "\n")
cat("Lilliefors p-value:", correct_result$p.value, "\n")
Ties in data: The K-S test assumes continuous distributions. Tied values (duplicates) violate this assumption:
# Data with ties triggers a warning
discrete_like <- round(rnorm(100, 50, 10))
ks.test(discrete_like, "pnorm", mean = 50, sd = 10)
# Suppress warning and use asymptotic p-value
ks.test(discrete_like, "pnorm", mean = 50, sd = 10, exact = FALSE)
Sample size sensitivity: With large samples, the test detects trivial differences. With small samples, it lacks power to detect real differences. Always pair statistical significance with effect size considerations.
# Large sample detects tiny difference
set.seed(42)
large_a <- rnorm(10000, mean = 0, sd = 1)
large_b <- rnorm(10000, mean = 0.05, sd = 1) # Barely different
ks.test(large_a, large_b) # Likely significant despite trivial difference
Practical Example: End-to-End Workflow
Let’s work through a complete analysis. You’re comparing API response times before and after a performance optimization.
# Complete K-S test workflow for response time analysis
library(ggplot2)
library(nortest)
# Simulated response time data (milliseconds)
set.seed(2024)
before_optimization <- c(
rlnorm(200, meanlog = 5, sdlog = 0.5),
rlnorm(20, meanlog = 6, sdlog = 0.3) # Some slow requests
)
after_optimization <- rlnorm(220, meanlog = 4.8, sdlog = 0.4)
# Step 1: Exploratory summary
cat("=== Response Time Summary ===\n")
cat("Before - Mean:", round(mean(before_optimization), 1), "ms,",
"Median:", round(median(before_optimization), 1), "ms\n")
cat("After - Mean:", round(mean(after_optimization), 1), "ms,",
"Median:", round(median(after_optimization), 1), "ms\n\n")
# Step 2: Test if distributions are normal (they probably aren't)
cat("=== Normality Tests (Lilliefors) ===\n")
before_normal <- lillie.test(before_optimization)
after_normal <- lillie.test(after_optimization)
cat("Before optimization - p-value:", format(before_normal$p.value, digits = 3), "\n")
cat("After optimization - p-value:", format(after_normal$p.value, digits = 3), "\n\n")
# Step 3: Two-sample K-S test
cat("=== Two-Sample K-S Test ===\n")
ks_result <- ks.test(before_optimization, after_optimization)
print(ks_result)
# Step 4: Interpretation
cat("\n=== Interpretation ===\n")
if (ks_result$p.value < 0.05) {
cat("The distributions differ significantly (p =",
format(ks_result$p.value, digits = 3), ")\n")
cat("D statistic:", round(ks_result$statistic, 3), "\n")
cat("The optimization changed the response time distribution.\n")
} else {
cat("No significant difference detected between distributions.\n")
}
# Step 5: Visualization
plot_df <- data.frame(
time = c(before_optimization, after_optimization),
period = rep(c("Before", "After"), c(length(before_optimization),
length(after_optimization)))
)
ggplot(plot_df, aes(x = time, color = period)) +
stat_ecdf(linewidth = 1.2) +
scale_x_log10() +
labs(
title = "API Response Time Distribution: Before vs. After Optimization",
subtitle = paste("K-S Test: D =", round(ks_result$statistic, 3),
", p =", format(ks_result$p.value, digits = 3)),
x = "Response Time (ms, log scale)",
y = "Cumulative Probability",
color = "Period"
) +
theme_minimal() +
theme(legend.position = "bottom")
This workflow gives you a complete picture: summary statistics for context, normality checks to justify using nonparametric methods, the K-S test result, and a visualization that makes the distributional shift immediately apparent.
The K-S test won’t tell you everything about your distributions, but it provides a rigorous, assumption-light method for detecting distributional differences. Combine it with visualizations and domain knowledge for actionable insights.