How to Perform Linear Regression in Python with statsmodels
Linear regression remains the workhorse of statistical modeling. At its core, Ordinary Least Squares (OLS) regression fits a line (or hyperplane) through your data by minimizing the sum of squared...
Key Insights
- statsmodels provides comprehensive statistical output including p-values, confidence intervals, and R-squared values that scikit-learn doesn’t offer, making it essential for inference-focused regression analysis.
- The formula API (
statsmodels.formula.api) lets you specify models using R-style syntax like'y ~ x1 + x2', which is more intuitive than manually constructing design matrices. - Always validate regression assumptions through residual diagnostics—statsmodels includes built-in tools for heteroscedasticity tests, normality checks, and influence analysis.
Introduction to Linear Regression
Linear regression remains the workhorse of statistical modeling. At its core, Ordinary Least Squares (OLS) regression fits a line (or hyperplane) through your data by minimizing the sum of squared residuals. The model assumes a linear relationship between predictors and the response variable.
The key assumptions you need to remember: linearity, independence of errors, homoscedasticity (constant variance of residuals), and normally distributed residuals. Violate these, and your coefficient estimates or standard errors become unreliable.
So why use statsmodels instead of scikit-learn? Simple: scikit-learn is built for prediction, statsmodels is built for inference. When you need p-values to determine if a coefficient is statistically significant, confidence intervals around predictions, or diagnostic tests for model assumptions, statsmodels delivers. Scikit-learn’s LinearRegression gives you coefficients and predictions—nothing more.
If you’re building a production ML pipeline, use scikit-learn. If you’re analyzing relationships between variables or need to defend your model’s statistical validity, statsmodels is the right tool.
Environment Setup & Data Preparation
First, install the required packages:
pip install statsmodels pandas numpy matplotlib seaborn
Let’s work with a synthetic dataset that simulates housing prices. This gives us control over the true relationships and makes interpretation clearer.
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
# Set seed for reproducibility
np.random.seed(42)
# Generate synthetic housing data
n = 200
data = pd.DataFrame({
'sqft': np.random.uniform(800, 3000, n),
'bedrooms': np.random.randint(1, 6, n),
'age': np.random.uniform(0, 50, n),
'distance_downtown': np.random.uniform(1, 30, n)
})
# True relationship: price = 50 + 0.15*sqft + 10*bedrooms - 0.5*age - 2*distance + noise
data['price'] = (
50 +
0.15 * data['sqft'] +
10 * data['bedrooms'] -
0.5 * data['age'] -
2 * data['distance_downtown'] +
np.random.normal(0, 30, n)
)
# Quick EDA
print(data.describe())
print(f"\nCorrelation with price:\n{data.corr()['price'].sort_values(ascending=False)}")
Always examine your data before modeling. Check for missing values, outliers, and the correlation structure between variables. This prevents surprises in your regression output.
Simple Linear Regression (One Predictor)
Let’s start with a single predictor: square footage predicting price. The formula API makes this readable.
# Simple linear regression using formula API
model_simple = smf.ols('price ~ sqft', data=data).fit()
# Print the full summary
print(model_simple.summary())
The summary output is dense but essential. Here’s what matters:
OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.847
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 1096.
Date: Prob (F-statistic): 1.23e-82
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 23.4521 8.234 2.849 0.005 7.221 39.683
sqft 0.1489 0.004 33.107 0.000 0.140 0.158
==============================================================================
Key metrics to examine:
- R-squared (0.847): 84.7% of price variance is explained by square footage. Strong relationship.
- Coefficient for sqft (0.1489): Each additional square foot increases price by ~$0.15 (in thousands, based on our data scale).
- P-value (<0.05): The sqft coefficient is statistically significant.
- Confidence interval [0.140, 0.158]: We’re 95% confident the true coefficient falls in this range.
The intercept (23.45) represents the predicted price when sqft equals zero—not meaningful here, but necessary for the model.
Multiple Linear Regression
Real-world models need multiple predictors. Adding them is straightforward:
# Multiple linear regression
model_full = smf.ols('price ~ sqft + bedrooms + age + distance_downtown', data=data).fit()
print(model_full.summary())
The output now shows coefficients for each predictor:
# Extract specific results programmatically
print("Coefficients:")
print(model_full.params)
print("\nP-values:")
print(model_full.pvalues)
print(f"\nAdjusted R-squared: {model_full.rsquared_adj:.4f}")
Interpreting multiple regression coefficients:
Each coefficient represents the effect of that variable holding all others constant. If the sqft coefficient is 0.148, that’s the price increase per square foot when bedrooms, age, and distance remain fixed.
Watch for multicollinearity—when predictors are highly correlated with each other. This inflates standard errors and makes individual coefficients unstable. Check Variance Inflation Factors (VIF):
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate VIF for each predictor
X = data[['sqft', 'bedrooms', 'age', 'distance_downtown']]
X_with_const = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data['Variable'] = X_with_const.columns
vif_data['VIF'] = [variance_inflation_factor(X_with_const.values, i)
for i in range(X_with_const.shape[1])]
print(vif_data)
VIF values above 5-10 indicate problematic multicollinearity. Consider removing or combining correlated predictors.
Model Diagnostics & Assumption Checking
A model is only as good as its assumptions. statsmodels provides tools to validate them.
Residual Analysis
# Get residuals and fitted values
residuals = model_full.resid
fitted = model_full.fittedvalues
# Create diagnostic plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# 1. Residuals vs Fitted (check linearity and homoscedasticity)
axes[0, 0].scatter(fitted, residuals, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')
# 2. Q-Q Plot (check normality)
sm.qqplot(residuals, line='45', ax=axes[0, 1])
axes[0, 1].set_title('Q-Q Plot')
# 3. Histogram of residuals
axes[1, 0].hist(residuals, bins=30, edgecolor='black')
axes[1, 0].set_xlabel('Residuals')
axes[1, 0].set_title('Residual Distribution')
# 4. Scale-Location plot
axes[1, 1].scatter(fitted, np.sqrt(np.abs(residuals)), alpha=0.5)
axes[1, 1].set_xlabel('Fitted Values')
axes[1, 1].set_ylabel('√|Residuals|')
axes[1, 1].set_title('Scale-Location')
plt.tight_layout()
plt.show()
Formal Statistical Tests
from statsmodels.stats.diagnostic import het_breuschpagan, het_white
from statsmodels.stats.stattools import jarque_bera
# Breusch-Pagan test for heteroscedasticity
bp_test = het_breuschpagan(residuals, model_full.model.exog)
print(f"Breusch-Pagan test:")
print(f" LM Statistic: {bp_test[0]:.4f}")
print(f" P-value: {bp_test[1]:.4f}")
print(f" Interpretation: {'Heteroscedasticity present' if bp_test[1] < 0.05 else 'Homoscedasticity OK'}")
# Jarque-Bera test for normality
jb_test = jarque_bera(residuals)
print(f"\nJarque-Bera test:")
print(f" Statistic: {jb_test[0]:.4f}")
print(f" P-value: {jb_test[1]:.4f}")
print(f" Interpretation: {'Non-normal residuals' if jb_test[1] < 0.05 else 'Normality OK'}")
If heteroscedasticity is detected, consider using robust standard errors:
# Fit model with heteroscedasticity-robust standard errors
model_robust = smf.ols('price ~ sqft + bedrooms + age + distance_downtown', data=data).fit(cov_type='HC3')
print(model_robust.summary())
Making Predictions & Confidence Intervals
Predictions are straightforward, but statsmodels shines when you need uncertainty quantification.
# Create new data for prediction
new_houses = pd.DataFrame({
'sqft': [1500, 2000, 2500],
'bedrooms': [3, 4, 4],
'age': [10, 5, 20],
'distance_downtown': [5, 10, 15]
})
# Simple point predictions
predictions = model_full.predict(new_houses)
print("Point predictions:")
print(predictions.values)
# Get predictions with confidence and prediction intervals
pred_summary = model_full.get_prediction(new_houses)
pred_df = pred_summary.summary_frame(alpha=0.05)
print("\nFull prediction summary:")
print(pred_df)
The output includes:
- mean: Point prediction
- mean_ci_lower/upper: Confidence interval for the mean response
- obs_ci_lower/upper: Prediction interval for individual observations
Prediction intervals are always wider than confidence intervals—they account for both uncertainty in the regression line and individual variation around it.
# Visualize predictions with intervals for sqft
sqft_range = pd.DataFrame({
'sqft': np.linspace(800, 3000, 100),
'bedrooms': 3,
'age': 15,
'distance_downtown': 10
})
pred = model_full.get_prediction(sqft_range)
pred_df = pred.summary_frame(alpha=0.05)
plt.figure(figsize=(10, 6))
plt.scatter(data['sqft'], data['price'], alpha=0.3, label='Data')
plt.plot(sqft_range['sqft'], pred_df['mean'], 'b-', label='Prediction')
plt.fill_between(sqft_range['sqft'], pred_df['mean_ci_lower'], pred_df['mean_ci_upper'],
alpha=0.3, label='95% CI (mean)')
plt.fill_between(sqft_range['sqft'], pred_df['obs_ci_lower'], pred_df['obs_ci_upper'],
alpha=0.1, label='95% PI (observation)')
plt.xlabel('Square Footage')
plt.ylabel('Price')
plt.legend()
plt.title('Predictions with Confidence and Prediction Intervals')
plt.show()
Conclusion & Next Steps
statsmodels transforms linear regression from a black-box prediction tool into a proper statistical analysis. You get the full picture: coefficient significance, model fit metrics, diagnostic tests, and prediction uncertainty.
When to use statsmodels:
- You need to interpret coefficient effects and their significance
- Stakeholders want confidence intervals and p-values
- You’re validating regression assumptions formally
- The goal is understanding relationships, not just prediction accuracy
Next steps to explore:
- Robust regression (
statsmodels.robust.robust_linear_model.RLM) for outlier-resistant estimation - Generalized Linear Models for non-normal response variables (logistic, Poisson regression)
- Mixed effects models (
statsmodels.regression.mixed_linear_model) for hierarchical data - Time series regression with autocorrelated errors
The formula API and comprehensive output make statsmodels the go-to library for statistical regression in Python. Master the diagnostics, understand the assumptions, and your regression analyses will be both defensible and insightful.