How to Perform Polynomial Regression in R

Key Insights

Polynomial regression extends linear regression to capture curved relationships by adding squared, cubed, or higher-order terms—use poly() for orthogonal polynomials or I(x^n) for raw coefficients
The biggest risk is overfitting: a degree-10 polynomial will fit your training data perfectly and generalize terribly—use cross-validation and AIC to select the right degree
Before reaching for polynomials, consider whether splines or GAMs might serve you better; polynomials behave erratically at the edges of your data

Introduction to Polynomial Regression

Linear regression assumes a straight-line relationship between your predictor and response. Reality rarely cooperates. Growth curves plateau, costs accelerate, and biological processes follow S-curves. When your scatter plot shows a clear curve, forcing a linear model onto it wastes information and produces biased predictions.

Polynomial regression solves this by adding powered terms of your predictor:

$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + … + \beta_n x^n + \epsilon$$

A quadratic (degree 2) captures a single bend. A cubic (degree 3) captures an inflection point. Higher degrees capture increasingly complex wiggles—and increasingly capture noise.

Let’s generate a dataset with an obvious curved relationship:

set.seed(42)
n <- 100

x <- seq(0, 10, length.out = n)
y <- 3 + 2*x - 0.3*x^2 + rnorm(n, sd = 2)

data <- data.frame(x = x, y = y)
head(data)

This simulates a quadratic relationship with some noise. The true data-generating process has a negative quadratic term, meaning the curve bends downward.

Preparing Your Data

Before fitting any model, visualize your data. A scatter plot immediately reveals whether you’re dealing with a linear or non-linear relationship.

library(ggplot2)

ggplot(data, aes(x = x, y = y)) +
  geom_point(alpha = 0.7, size = 2) +
  labs(
    title = "Scatter Plot Revealing Non-Linear Pattern",
    x = "Predictor (x)",
    y = "Response (y)"
  ) +
  theme_minimal()

The inverted-U shape screams “don’t use linear regression.” A linear fit would systematically underpredict in the middle and overpredict at the edges.

Check for missing values and outliers before proceeding:

summary(data)
any(is.na(data))

Outliers matter more in polynomial regression than in linear regression. A single extreme point can dramatically pull your fitted curve, especially at higher degrees.

Fitting Polynomial Models in R

R gives you two syntaxes for polynomial terms, and they’re not equivalent.

Method 1: Raw polynomials with I()

model_raw <- lm(y ~ x + I(x^2), data = data)
summary(model_raw)

The I() function tells R to interpret x^2 literally rather than as a formula operator. This gives you interpretable coefficients—the coefficient on x^2 directly tells you the quadratic effect.

Method 2: Orthogonal polynomials with poly()

model_ortho <- lm(y ~ poly(x, 2), data = data)
summary(model_ortho)

The poly() function creates orthogonal polynomial terms. The coefficients aren’t directly interpretable, but the terms are uncorrelated with each other. This improves numerical stability and makes it easier to assess whether adding higher-order terms helps.

For raw coefficients from poly(), use raw = TRUE:

model_poly_raw <- lm(y ~ poly(x, 2, raw = TRUE), data = data)
summary(model_poly_raw)

Which should you use? Use poly() with raw = FALSE (the default) when you’re model building and comparing degrees. Use I(x^n) or poly(raw = TRUE) when you need interpretable coefficients for reporting.

Let’s fit quadratic and cubic models:

model_quad <- lm(y ~ poly(x, 2), data = data)
model_cubic <- lm(y ~ poly(x, 3), data = data)

summary(model_quad)
summary(model_cubic)

Choosing the Right Polynomial Degree

Here’s where most practitioners go wrong. A higher-degree polynomial always fits the training data better. A degree-99 polynomial through 100 points achieves perfect fit. It’s also completely useless for prediction.

The bias-variance tradeoff governs your choice:

Too low a degree: High bias, underfitting, missing the true pattern
Too high a degree: High variance, overfitting, fitting noise

Use cross-validation and information criteria to find the sweet spot:

library(caret)

# Set up 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)

# Compare degrees 1 through 5
results <- data.frame(
  degree = 1:5,
  RMSE = NA,
  R2 = NA,
  AIC = NA
)

for (d in 1:5) {
  # Fit model
  model <- lm(y ~ poly(x, d), data = data)
  
  # Cross-validated RMSE
  cv_model <- train(
    y ~ poly(x, d),
    data = data,
    method = "lm",
    trControl = train_control
  )
  
  results$RMSE[d] <- cv_model$results$RMSE
  results$R2[d] <- cv_model$results$Rsquared
  results$AIC[d] <- AIC(model)
}

print(results)

Look for the degree where cross-validated RMSE stops improving meaningfully. AIC penalizes complexity, so lower is better. In our simulated data, degree 2 should win since that’s the true data-generating process.

A more manual approach using adjusted R²:

adj_r2 <- sapply(1:5, function(d) {
  model <- lm(y ~ poly(x, d), data = data)
  summary(model)$adj.r.squared
})

plot(1:5, adj_r2, type = "b", 
     xlab = "Polynomial Degree", 
     ylab = "Adjusted R²",
     main = "Model Selection via Adjusted R²")

Adjusted R² penalizes additional parameters, so it won’t always increase with degree.

Model Evaluation and Diagnostics

Once you’ve selected a degree, evaluate the model thoroughly.

final_model <- lm(y ~ poly(x, 2, raw = TRUE), data = data)
summary(final_model)

The summary shows:

Coefficients: The intercept, linear, and quadratic effects
p-values: Whether each term is statistically significant
R²: Proportion of variance explained
F-statistic: Overall model significance

Check residual diagnostics:

par(mfrow = c(2, 2))
plot(final_model)

These four plots reveal:

Residuals vs Fitted: Should show no pattern. Curves here suggest you need a different model
Q-Q Plot: Points should follow the diagonal for normally distributed residuals
Scale-Location: Should show constant spread (homoscedasticity)
Residuals vs Leverage: Identifies influential points

For polynomial regression specifically, watch for patterns in the residuals vs. fitted plot. If you see a curve, you may need a higher degree or a different approach entirely.

# Formal test for heteroscedasticity
library(lmtest)
bptest(final_model)

# Test for normality of residuals
shapiro.test(residuals(final_model))

Visualization and Prediction

Visualize your fitted model against the data:

# Create prediction data
pred_data <- data.frame(x = seq(min(data$x), max(data$x), length.out = 200))
pred_data$y_pred <- predict(final_model, newdata = pred_data)

# Get confidence intervals
pred_ci <- predict(final_model, newdata = pred_data, interval = "confidence")
pred_data$lower <- pred_ci[, "lwr"]
pred_data$upper <- pred_ci[, "upr"]

# Plot
ggplot() +
  geom_point(data = data, aes(x = x, y = y), alpha = 0.6) +
  geom_line(data = pred_data, aes(x = x, y = y_pred), 
            color = "blue", size = 1) +
  geom_ribbon(data = pred_data, aes(x = x, ymin = lower, ymax = upper),
              alpha = 0.2, fill = "blue") +
  labs(
    title = "Polynomial Regression Fit with 95% Confidence Interval",
    x = "Predictor (x)",
    y = "Response (y)"
  ) +
  theme_minimal()

For quick visualization, geom_smooth() handles everything:

ggplot(data, aes(x = x, y = y)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), 
              color = "red", fill = "red", alpha = 0.2) +
  theme_minimal()

Make predictions for new data:

new_observations <- data.frame(x = c(2.5, 5.0, 7.5))

predictions <- predict(final_model, newdata = new_observations, 
                       interval = "prediction", level = 0.95)

cbind(new_observations, predictions)

Note the difference between confidence intervals (uncertainty in the mean response) and prediction intervals (uncertainty for individual observations). Prediction intervals are always wider.

Conclusion and Best Practices

Polynomial regression is a straightforward extension of linear regression that handles curved relationships. But it’s not always the right tool.

Use polynomial regression when:

You have a clear theoretical reason for a polynomial relationship
The curve is relatively simple (quadratic or cubic)
You need interpretable coefficients

Consider alternatives when:

The relationship has multiple bends or complex patterns—use splines
You don’t know the functional form—use generalized additive models (GAMs)
The relationship is inherently non-linear (exponential, logistic)—use appropriate non-linear models

Common pitfalls to avoid:

Overfitting: Never go above degree 3 or 4 without strong justification and rigorous cross-validation
Extrapolation: Polynomials explode outside the training data range. A quadratic might predict negative values or absurd extremes
Multicollinearity: Raw polynomial terms are highly correlated. Use orthogonal polynomials or center your data
Ignoring diagnostics: Residual patterns indicate model misspecification

The workflow is simple: visualize, fit multiple degrees, select via cross-validation, diagnose, and visualize again. When polynomials aren’t cutting it, move to splines or GAMs—they’re more flexible and better behaved at the boundaries.