How to Calculate R-Squared for Machine Learning in Python
R-squared (R²) is the most widely used metric for evaluating regression models. It tells you what percentage of the variance in your target variable is explained by your model's predictions. An R² of...
Key Insights
- R-squared measures the proportion of variance in your target variable that your model explains, ranging from 0 (useless) to 1 (perfect), but can be misleading with complex models or when adding features
- You can calculate R-squared using scikit-learn’s
r2_score()function or implement it manually with the formula R² = 1 - (SS_res / SS_tot) to understand the underlying mechanics - Always evaluate R-squared alongside other metrics like RMSE and MAE, and use adjusted R-squared when comparing models with different numbers of features to avoid overfitting traps
Introduction to R-Squared
R-squared (R²) is the most widely used metric for evaluating regression models. It tells you what percentage of the variance in your target variable is explained by your model’s predictions. An R² of 0.85 means your model explains 85% of the variance in the data—the remaining 15% is due to factors your model doesn’t capture.
Understanding R-squared is critical because it gives you immediate feedback on whether your model is capturing meaningful patterns or just making random guesses. A model with R² close to 0 is essentially useless, while one approaching 1.0 fits your data extremely well (though this can be a red flag for overfitting).
The metric works by comparing your model’s predictions against a naive baseline—simply predicting the mean value for every observation. If your sophisticated machine learning model performs worse than this baseline, you’ll get a negative R-squared value, which is a clear signal something is fundamentally wrong.
The Mathematics Behind R-Squared
The R-squared formula breaks down into two components:
R² = 1 - (SS_res / SS_tot)
- SS_res (Residual Sum of Squares): The sum of squared differences between actual values and predicted values. This measures how much error your model makes.
- SS_tot (Total Sum of Squares): The sum of squared differences between actual values and the mean of actual values. This represents the total variance in your data.
When your model predicts perfectly, SS_res equals zero, making R² equal to 1. When your model performs as poorly as just predicting the mean, SS_res equals SS_tot, making R² equal to 0.
Here’s a visualization showing actual vs predicted values with residuals:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Generate sample data
np.random.seed(42)
X = np.linspace(0, 10, 50).reshape(-1, 1)
y = 2 * X.ravel() + 1 + np.random.normal(0, 2, 50)
# Fit model
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.6, label='Actual values')
plt.plot(X, y_pred, 'r-', linewidth=2, label='Predicted values')
# Draw residuals
for i in range(len(X)):
plt.plot([X[i], X[i]], [y[i], y_pred[i]], 'g--', alpha=0.3)
plt.axhline(y=np.mean(y), color='blue', linestyle='--', label='Mean baseline')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.title('Residuals: Actual vs Predicted Values')
plt.tight_layout()
plt.savefig('r_squared_visualization.png', dpi=300, bbox_inches='tight')
plt.show()
The green dashed lines represent residuals—the errors your model makes. R-squared essentially measures how much smaller these residuals are compared to just predicting the mean (blue line).
Calculating R-Squared with Scikit-Learn
Scikit-learn provides a built-in r2_score() function that makes calculating R-squared trivial. Here’s a complete example using a real dataset:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate R-squared
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2:.4f}")
# Alternative: use model's built-in score method
r2_alternative = model.score(X_test, y_test)
print(f"R-squared (alternative): {r2_alternative:.4f}")
The LinearRegression object also has a .score() method that returns R-squared directly, which is convenient but less flexible than using r2_score() explicitly.
Manual R-Squared Implementation
Understanding how R-squared is calculated helps you interpret it better. Here’s a from-scratch implementation:
import numpy as np
def calculate_r2_manual(y_true, y_pred):
"""
Calculate R-squared manually using the formula:
R² = 1 - (SS_res / SS_tot)
"""
# Convert to numpy arrays
y_true = np.array(y_true)
y_pred = np.array(y_pred)
# Calculate residual sum of squares
ss_res = np.sum((y_true - y_pred) ** 2)
# Calculate total sum of squares
ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
# Calculate R-squared
r2 = 1 - (ss_res / ss_tot)
return r2
# Using our previous example
r2_manual = calculate_r2_manual(y_test, y_pred)
r2_sklearn = r2_score(y_test, y_pred)
print(f"Manual R²: {r2_manual:.6f}")
print(f"Sklearn R²: {r2_sklearn:.6f}")
print(f"Difference: {abs(r2_manual - r2_sklearn):.10f}")
This implementation reveals the mechanics: you’re comparing the squared errors of your model against the squared errors of the simplest possible model (predicting the mean). The closer these values are, the worse your R-squared.
R-Squared for Different Model Types
R-squared works across different regression algorithms. Here’s a comparison:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
# Prepare data
X_train_small = X_train[:1000] # Use subset for faster training
y_train_small = y_train[:1000]
# Define models
models = {
'Linear Regression': LinearRegression(),
'Polynomial Regression (degree=2)': LinearRegression(),
'Decision Tree': DecisionTreeRegressor(max_depth=10, random_state=42),
'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
}
# Train and evaluate
results = {}
for name, model in models.items():
if 'Polynomial' in name:
# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train_small)
X_test_poly = poly.transform(X_test)
model.fit(X_train_poly, y_train_small)
y_pred = model.predict(X_test_poly)
else:
model.fit(X_train_small, y_train_small)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
results[name] = r2
print(f"{name:35} R² = {r2:.4f}")
# Find best model
best_model = max(results, key=results.get)
print(f"\nBest model: {best_model} (R² = {results[best_model]:.4f})")
Different algorithms capture different patterns. Tree-based models often achieve higher R-squared on complex datasets, but this doesn’t automatically make them better—they might be overfitting.
Limitations and Adjusted R-Squared
R-squared has a critical flaw: it always increases when you add more features, even if those features are random noise. This makes it unreliable for comparing models with different numbers of predictors.
Adjusted R-squared penalizes model complexity:
Adjusted R² = 1 - [(1 - R²) × (n - 1) / (n - p - 1)]
Where n is the number of observations and p is the number of features.
def adjusted_r2(y_true, y_pred, n_features):
"""Calculate adjusted R-squared"""
n = len(y_true)
r2 = r2_score(y_true, y_pred)
adjusted = 1 - (1 - r2) * (n - 1) / (n - n_features - 1)
return adjusted
# Demonstrate the problem with regular R²
np.random.seed(42)
X_original = X_train[:500, :3] # Use 3 features
y_small = y_train[:500]
# Add 10 random noise features
X_noise = np.random.randn(500, 10)
X_with_noise = np.hstack([X_original, X_noise])
# Train models
model_original = LinearRegression().fit(X_original, y_small)
model_noise = LinearRegression().fit(X_with_noise, y_small)
# Evaluate on test set
y_pred_original = model_original.predict(X_test[:, :3])
y_pred_noise = model_noise.predict(
np.hstack([X_test[:, :3], np.random.randn(len(X_test), 10)])
)
r2_original = r2_score(y_test, y_pred_original)
r2_noise = r2_score(y_test, y_pred_noise)
adj_r2_original = adjusted_r2(y_test, y_pred_original, 3)
adj_r2_noise = adjusted_r2(y_test, y_pred_noise, 13)
print(f"Model with 3 features:")
print(f" R² = {r2_original:.4f}, Adjusted R² = {adj_r2_original:.4f}")
print(f"\nModel with 13 features (10 random):")
print(f" R² = {r2_noise:.4f}, Adjusted R² = {adj_r2_noise:.4f}")
Notice how adjusted R-squared correctly penalizes the model with random features, while regular R-squared might misleadingly suggest improvement.
Best Practices and Alternatives
Never rely on R-squared alone. Use it alongside other metrics for a complete picture:
from sklearn.metrics import mean_squared_error, mean_absolute_error
def comprehensive_evaluation(y_true, y_pred, n_features=None):
"""Complete regression model evaluation"""
r2 = r2_score(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
print(f"R² Score: {r2:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
if n_features is not None:
adj_r2 = adjusted_r2(y_true, y_pred, n_features)
print(f"Adjusted R²: {adj_r2:.4f}")
# Calculate relative metrics
mean_target = np.mean(y_true)
print(f"\nRelative to mean target ({mean_target:.2f}):")
print(f" RMSE as % of mean: {(rmse/mean_target)*100:.2f}%")
print(f" MAE as % of mean: {(mae/mean_target)*100:.2f}%")
# Evaluate our best model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
comprehensive_evaluation(y_test, y_pred, n_features=X_train.shape[1])
Key guidelines:
- Use R² for quick model comparison during development, but don’t stop there
- Prefer adjusted R² when comparing models with different feature counts
- Check RMSE/MAE to understand prediction errors in the original units
- Watch for negative R² on test data—it means your model is worse than predicting the mean
- Don’t chase perfect R²—values above 0.95 often indicate overfitting or data leakage
- Consider domain context—R² of 0.70 might be excellent in noisy domains like social sciences but poor for physics simulations
R-squared is a starting point, not the finish line. Use it to guide your modeling decisions, but always validate with multiple metrics and domain expertise.