How to Calculate Skewness in R
Skewness measures the asymmetry of a probability distribution around its mean. While mean and standard deviation tell you about central tendency and spread, skewness reveals whether your data leans...
Key Insights
- Base R lacks a built-in skewness function, but you can calculate it manually using the third standardized moment formula or rely on the
e1071andmomentspackages for production code. - The
typeparameter in skewness functions matters significantly—different estimation methods can yield noticeably different results, especially with small sample sizes. - Skewness values beyond ±1 typically indicate substantial asymmetry that may violate assumptions for parametric statistical tests, often requiring log or Box-Cox transformations.
Introduction to Skewness
Skewness measures the asymmetry of a probability distribution around its mean. While mean and standard deviation tell you about central tendency and spread, skewness reveals whether your data leans left or right—information that’s critical for choosing appropriate statistical methods.
A distribution can exhibit three types of skewness:
- Positive skew (right-skewed): The tail extends toward higher values. Income distributions are the classic example—most people earn modest amounts while a few earn substantially more, pulling the mean above the median.
- Negative skew (left-skewed): The tail extends toward lower values. Age at retirement often shows this pattern—most people retire around 65, but some retire much earlier.
- Zero skew: The distribution is symmetric. Normal distributions have zero skewness by definition.
Understanding skewness matters because many statistical techniques assume normally distributed data. When your data is significantly skewed, t-tests, ANOVA, and linear regression can produce misleading results. Knowing how to detect and quantify skewness is the first step toward addressing it.
The Mathematics Behind Skewness
The most common skewness measure is the third standardized moment, calculated as:
$$\gamma_1 = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{s} \right)^3$$
This formula cubes the standardized deviations from the mean. Cubing preserves the sign—values below the mean contribute negative terms, values above contribute positive terms. If the distribution is symmetric, these cancel out. If it’s asymmetric, you get a non-zero result indicating the direction and magnitude of skew.
Alternative formulas exist. Pearson’s first coefficient of skewness uses (mean - mode) / sd, while Pearson’s second uses 3(mean - median) / sd. Bowley’s skewness uses quartiles. Each has trade-offs in robustness and interpretability.
Here’s how to calculate skewness manually in base R:
# Generate sample data with positive skew
set.seed(42)
data <- rexp(1000, rate = 0.5) # Exponential distribution is right-skewed
# Manual skewness calculation (population formula)
n <- length(data)
mean_val <- mean(data)
sd_val <- sd(data)
# Third standardized moment
skewness_manual <- sum(((data - mean_val) / sd_val)^3) / n
print(skewness_manual)
# [1] 1.976435
This calculation uses the population formula. Sample-adjusted versions apply correction factors, which we’ll explore with the packages.
Using Base R for Skewness Calculation
Base R doesn’t include a skewness function. This surprises many users given how fundamental the metric is, but it’s easy enough to create your own.
Here’s a robust custom function that handles edge cases and supports different calculation types:
skewness_custom <- function(x, na.rm = FALSE, type = 1) {
# Handle missing values
if (na.rm) {
x <- x[!is.na(x)]
} else if (any(is.na(x))) {
return(NA_real_)
}
n <- length(x)
if (n < 3) {
warning("Skewness requires at least 3 observations")
return(NA_real_)
}
mean_x <- mean(x)
m3 <- sum((x - mean_x)^3) / n
m2 <- sum((x - mean_x)^2) / n
g1 <- m3 / (m2^1.5)
# Apply bias correction based on type
if (type == 1) {
# Moment estimator (no correction)
return(g1)
} else if (type == 2) {
# Sample skewness with bias correction (SAS/SPSS default)
return(g1 * sqrt(n * (n - 1)) / (n - 2))
} else if (type == 3) {
# Another common adjustment
return(g1 * ((n - 1) / n)^1.5)
} else {
stop("type must be 1, 2, or 3")
}
}
# Test with our exponential data
skewness_custom(data, type = 1) # [1] 1.976435
skewness_custom(data, type = 2) # [1] 1.979403
The type parameter mirrors what you’ll find in packages. Type 2 applies a bias correction that becomes negligible with large samples but matters for small datasets.
Calculating Skewness with Popular Packages
For production code, use established packages. The two main options are e1071 and moments.
# Install if needed
# install.packages("e1071")
# install.packages("moments")
library(e1071)
library(moments)
# Create datasets with different skew patterns
set.seed(123)
right_skewed <- rexp(500, rate = 1)
left_skewed <- -rexp(500, rate = 1)
symmetric <- rnorm(500, mean = 50, sd = 10)
# Using e1071
e1071::skewness(right_skewed, type = 1) # [1] 2.041
e1071::skewness(right_skewed, type = 2) # [1] 2.047
e1071::skewness(right_skewed, type = 3) # [1] 2.029
# Using moments
moments::skewness(right_skewed) # [1] 2.041
# Compare all three datasets
cat("Right-skewed:", e1071::skewness(right_skewed), "\n")
cat("Left-skewed:", e1071::skewness(left_skewed), "\n")
cat("Symmetric:", e1071::skewness(symmetric), "\n")
Output:
Right-skewed: 2.041234
Left-skewed: -2.087651
Symmetric: 0.04521893
The e1071 package offers three type options matching different statistical software conventions. Type 2 corresponds to SAS and SPSS defaults. The moments package uses type 1 (the simple moment estimator) and doesn’t expose alternatives, but it’s simpler if you just need a quick calculation.
My recommendation: use e1071 with type = 2 for most analyses. The bias correction helps with smaller samples, and it matches what collaborators using other statistical software will expect.
Visualizing Skewness
Numbers alone don’t tell the full story. Always pair skewness calculations with visualization.
library(ggplot2)
library(dplyr)
# Prepare data
set.seed(456)
income_data <- data.frame(
income = rlnorm(1000, meanlog = 10.5, sdlog = 0.8)
)
# Calculate statistics
income_mean <- mean(income_data$income)
income_median <- median(income_data$income)
income_skew <- e1071::skewness(income_data$income)
# Create visualization
ggplot(income_data, aes(x = income)) +
geom_histogram(aes(y = after_stat(density)),
bins = 40,
fill = "#2C3E50",
alpha = 0.7) +
geom_density(color = "#E74C3C", linewidth = 1) +
geom_vline(aes(xintercept = income_mean, color = "Mean"),
linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = income_median, color = "Median"),
linewidth = 1, linetype = "dashed") +
scale_color_manual(name = "Statistics",
values = c("Mean" = "#3498DB", "Median" = "#27AE60")) +
labs(
title = sprintf("Income Distribution (Skewness = %.2f)", income_skew),
subtitle = "Right-skewed: Mean > Median, tail extends toward higher values",
x = "Income ($)",
y = "Density"
) +
theme_minimal() +
theme(legend.position = "top")
The visualization immediately confirms what the skewness value tells us. The mean sits to the right of the median, and the distribution has a long right tail. This visual-numeric pairing should be standard practice in exploratory data analysis.
Practical Application: Interpreting Results
Use these thresholds as a starting point for interpretation:
| Skewness Value | Interpretation |
|---|---|
| -0.5 to 0.5 | Approximately symmetric |
| -1 to -0.5 or 0.5 to 1 | Moderately skewed |
| < -1 or > 1 | Highly skewed |
These aren’t rigid rules. Context matters. A skewness of 0.8 might be acceptable for a large-sample t-test but problematic for a small-sample regression.
When skewness violates test assumptions, transformation is the standard remedy. Log transformation works well for right-skewed data:
# Original right-skewed data
set.seed(789)
original <- rlnorm(500, meanlog = 3, sdlog = 1.2)
# Log transformation
transformed <- log(original)
# Compare skewness
cat("Original skewness:", e1071::skewness(original), "\n")
cat("Transformed skewness:", e1071::skewness(transformed), "\n")
# Visual comparison
par(mfrow = c(1, 2))
hist(original, main = sprintf("Original (skew = %.2f)",
e1071::skewness(original)),
col = "#E74C3C", border = "white", xlab = "Value")
hist(transformed, main = sprintf("Log-Transformed (skew = %.2f)",
e1071::skewness(transformed)),
col = "#27AE60", border = "white", xlab = "log(Value)")
par(mfrow = c(1, 1))
Output:
Original skewness: 4.127653
Transformed skewness: 0.02341876
The log transformation reduced skewness from 4.13 to essentially zero. For data that can’t be log-transformed (contains zeros or negatives), consider log(x + 1), square root, or Box-Cox transformations.
Conclusion
Calculating skewness in R is straightforward once you know your options. For quick exploratory work, the moments package provides a simple one-liner. For production analyses where you need control over bias correction, use e1071::skewness() with an explicit type parameter.
Always visualize your distributions—skewness is one number summarizing an entire distribution shape, and visual inspection catches nuances that a single metric misses. When you find substantial skewness that threatens your statistical assumptions, log and power transformations are your primary tools for remediation.
The key is making skewness assessment a routine part of your exploratory data analysis workflow. Check it before running parametric tests, and you’ll avoid the embarrassment of presenting results based on violated assumptions.