How to Perform Polynomial Regression in R
Linear regression assumes a straight-line relationship between your predictor and response. Reality rarely cooperates. Growth curves plateau, costs accelerate, and biological processes follow...
Key Insights
- Polynomial regression extends linear regression to capture curved relationships by adding squared, cubed, or higher-order terms—use
poly()for orthogonal polynomials orI(x^n)for raw coefficients - The biggest risk is overfitting: a degree-10 polynomial will fit your training data perfectly and generalize terribly—use cross-validation and AIC to select the right degree
- Before reaching for polynomials, consider whether splines or GAMs might serve you better; polynomials behave erratically at the edges of your data
Introduction to Polynomial Regression
Linear regression assumes a straight-line relationship between your predictor and response. Reality rarely cooperates. Growth curves plateau, costs accelerate, and biological processes follow S-curves. When your scatter plot shows a clear curve, forcing a linear model onto it wastes information and produces biased predictions.
Polynomial regression solves this by adding powered terms of your predictor:
$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + … + \beta_n x^n + \epsilon$$
A quadratic (degree 2) captures a single bend. A cubic (degree 3) captures an inflection point. Higher degrees capture increasingly complex wiggles—and increasingly capture noise.
Let’s generate a dataset with an obvious curved relationship:
set.seed(42)
n <- 100
x <- seq(0, 10, length.out = n)
y <- 3 + 2*x - 0.3*x^2 + rnorm(n, sd = 2)
data <- data.frame(x = x, y = y)
head(data)
This simulates a quadratic relationship with some noise. The true data-generating process has a negative quadratic term, meaning the curve bends downward.
Preparing Your Data
Before fitting any model, visualize your data. A scatter plot immediately reveals whether you’re dealing with a linear or non-linear relationship.
library(ggplot2)
ggplot(data, aes(x = x, y = y)) +
geom_point(alpha = 0.7, size = 2) +
labs(
title = "Scatter Plot Revealing Non-Linear Pattern",
x = "Predictor (x)",
y = "Response (y)"
) +
theme_minimal()
The inverted-U shape screams “don’t use linear regression.” A linear fit would systematically underpredict in the middle and overpredict at the edges.
Check for missing values and outliers before proceeding:
summary(data)
any(is.na(data))
Outliers matter more in polynomial regression than in linear regression. A single extreme point can dramatically pull your fitted curve, especially at higher degrees.
Fitting Polynomial Models in R
R gives you two syntaxes for polynomial terms, and they’re not equivalent.
Method 1: Raw polynomials with I()
model_raw <- lm(y ~ x + I(x^2), data = data)
summary(model_raw)
The I() function tells R to interpret x^2 literally rather than as a formula operator. This gives you interpretable coefficients—the coefficient on x^2 directly tells you the quadratic effect.
Method 2: Orthogonal polynomials with poly()
model_ortho <- lm(y ~ poly(x, 2), data = data)
summary(model_ortho)
The poly() function creates orthogonal polynomial terms. The coefficients aren’t directly interpretable, but the terms are uncorrelated with each other. This improves numerical stability and makes it easier to assess whether adding higher-order terms helps.
For raw coefficients from poly(), use raw = TRUE:
model_poly_raw <- lm(y ~ poly(x, 2, raw = TRUE), data = data)
summary(model_poly_raw)
Which should you use? Use poly() with raw = FALSE (the default) when you’re model building and comparing degrees. Use I(x^n) or poly(raw = TRUE) when you need interpretable coefficients for reporting.
Let’s fit quadratic and cubic models:
model_quad <- lm(y ~ poly(x, 2), data = data)
model_cubic <- lm(y ~ poly(x, 3), data = data)
summary(model_quad)
summary(model_cubic)
Choosing the Right Polynomial Degree
Here’s where most practitioners go wrong. A higher-degree polynomial always fits the training data better. A degree-99 polynomial through 100 points achieves perfect fit. It’s also completely useless for prediction.
The bias-variance tradeoff governs your choice:
- Too low a degree: High bias, underfitting, missing the true pattern
- Too high a degree: High variance, overfitting, fitting noise
Use cross-validation and information criteria to find the sweet spot:
library(caret)
# Set up 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)
# Compare degrees 1 through 5
results <- data.frame(
degree = 1:5,
RMSE = NA,
R2 = NA,
AIC = NA
)
for (d in 1:5) {
# Fit model
model <- lm(y ~ poly(x, d), data = data)
# Cross-validated RMSE
cv_model <- train(
y ~ poly(x, d),
data = data,
method = "lm",
trControl = train_control
)
results$RMSE[d] <- cv_model$results$RMSE
results$R2[d] <- cv_model$results$Rsquared
results$AIC[d] <- AIC(model)
}
print(results)
Look for the degree where cross-validated RMSE stops improving meaningfully. AIC penalizes complexity, so lower is better. In our simulated data, degree 2 should win since that’s the true data-generating process.
A more manual approach using adjusted R²:
adj_r2 <- sapply(1:5, function(d) {
model <- lm(y ~ poly(x, d), data = data)
summary(model)$adj.r.squared
})
plot(1:5, adj_r2, type = "b",
xlab = "Polynomial Degree",
ylab = "Adjusted R²",
main = "Model Selection via Adjusted R²")
Adjusted R² penalizes additional parameters, so it won’t always increase with degree.
Model Evaluation and Diagnostics
Once you’ve selected a degree, evaluate the model thoroughly.
final_model <- lm(y ~ poly(x, 2, raw = TRUE), data = data)
summary(final_model)
The summary shows:
- Coefficients: The intercept, linear, and quadratic effects
- p-values: Whether each term is statistically significant
- R²: Proportion of variance explained
- F-statistic: Overall model significance
Check residual diagnostics:
par(mfrow = c(2, 2))
plot(final_model)
These four plots reveal:
- Residuals vs Fitted: Should show no pattern. Curves here suggest you need a different model
- Q-Q Plot: Points should follow the diagonal for normally distributed residuals
- Scale-Location: Should show constant spread (homoscedasticity)
- Residuals vs Leverage: Identifies influential points
For polynomial regression specifically, watch for patterns in the residuals vs. fitted plot. If you see a curve, you may need a higher degree or a different approach entirely.
# Formal test for heteroscedasticity
library(lmtest)
bptest(final_model)
# Test for normality of residuals
shapiro.test(residuals(final_model))
Visualization and Prediction
Visualize your fitted model against the data:
# Create prediction data
pred_data <- data.frame(x = seq(min(data$x), max(data$x), length.out = 200))
pred_data$y_pred <- predict(final_model, newdata = pred_data)
# Get confidence intervals
pred_ci <- predict(final_model, newdata = pred_data, interval = "confidence")
pred_data$lower <- pred_ci[, "lwr"]
pred_data$upper <- pred_ci[, "upr"]
# Plot
ggplot() +
geom_point(data = data, aes(x = x, y = y), alpha = 0.6) +
geom_line(data = pred_data, aes(x = x, y = y_pred),
color = "blue", size = 1) +
geom_ribbon(data = pred_data, aes(x = x, ymin = lower, ymax = upper),
alpha = 0.2, fill = "blue") +
labs(
title = "Polynomial Regression Fit with 95% Confidence Interval",
x = "Predictor (x)",
y = "Response (y)"
) +
theme_minimal()
For quick visualization, geom_smooth() handles everything:
ggplot(data, aes(x = x, y = y)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2),
color = "red", fill = "red", alpha = 0.2) +
theme_minimal()
Make predictions for new data:
new_observations <- data.frame(x = c(2.5, 5.0, 7.5))
predictions <- predict(final_model, newdata = new_observations,
interval = "prediction", level = 0.95)
cbind(new_observations, predictions)
Note the difference between confidence intervals (uncertainty in the mean response) and prediction intervals (uncertainty for individual observations). Prediction intervals are always wider.
Conclusion and Best Practices
Polynomial regression is a straightforward extension of linear regression that handles curved relationships. But it’s not always the right tool.
Use polynomial regression when:
- You have a clear theoretical reason for a polynomial relationship
- The curve is relatively simple (quadratic or cubic)
- You need interpretable coefficients
Consider alternatives when:
- The relationship has multiple bends or complex patterns—use splines
- You don’t know the functional form—use generalized additive models (GAMs)
- The relationship is inherently non-linear (exponential, logistic)—use appropriate non-linear models
Common pitfalls to avoid:
- Overfitting: Never go above degree 3 or 4 without strong justification and rigorous cross-validation
- Extrapolation: Polynomials explode outside the training data range. A quadratic might predict negative values or absurd extremes
- Multicollinearity: Raw polynomial terms are highly correlated. Use orthogonal polynomials or center your data
- Ignoring diagnostics: Residual patterns indicate model misspecification
The workflow is simple: visualize, fit multiple degrees, select via cross-validation, diagnose, and visualize again. When polynomials aren’t cutting it, move to splines or GAMs—they’re more flexible and better behaved at the boundaries.