How to Calculate R-Squared for Machine Learning in Python

Key Insights

R-squared measures the proportion of variance in your target variable that your model explains, ranging from 0 (useless) to 1 (perfect), but can be misleading with complex models or when adding features
You can calculate R-squared using scikit-learn’s r2_score() function or implement it manually with the formula R² = 1 - (SS_res / SS_tot) to understand the underlying mechanics
Always evaluate R-squared alongside other metrics like RMSE and MAE, and use adjusted R-squared when comparing models with different numbers of features to avoid overfitting traps

Introduction to R-Squared

R-squared (R²) is the most widely used metric for evaluating regression models. It tells you what percentage of the variance in your target variable is explained by your model’s predictions. An R² of 0.85 means your model explains 85% of the variance in the data—the remaining 15% is due to factors your model doesn’t capture.

Understanding R-squared is critical because it gives you immediate feedback on whether your model is capturing meaningful patterns or just making random guesses. A model with R² close to 0 is essentially useless, while one approaching 1.0 fits your data extremely well (though this can be a red flag for overfitting).

The metric works by comparing your model’s predictions against a naive baseline—simply predicting the mean value for every observation. If your sophisticated machine learning model performs worse than this baseline, you’ll get a negative R-squared value, which is a clear signal something is fundamentally wrong.

The Mathematics Behind R-Squared

The R-squared formula breaks down into two components:

R² = 1 - (SS_res / SS_tot)

SS_res (Residual Sum of Squares): The sum of squared differences between actual values and predicted values. This measures how much error your model makes.
SS_tot (Total Sum of Squares): The sum of squared differences between actual values and the mean of actual values. This represents the total variance in your data.

When your model predicts perfectly, SS_res equals zero, making R² equal to 1. When your model performs as poorly as just predicting the mean, SS_res equals SS_tot, making R² equal to 0.

Here’s a visualization showing actual vs predicted values with residuals:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate sample data
np.random.seed(42)
X = np.linspace(0, 10, 50).reshape(-1, 1)
y = 2 * X.ravel() + 1 + np.random.normal(0, 2, 50)

# Fit model
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.6, label='Actual values')
plt.plot(X, y_pred, 'r-', linewidth=2, label='Predicted values')

# Draw residuals
for i in range(len(X)):
    plt.plot([X[i], X[i]], [y[i], y_pred[i]], 'g--', alpha=0.3)

plt.axhline(y=np.mean(y), color='blue', linestyle='--', label='Mean baseline')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.title('Residuals: Actual vs Predicted Values')
plt.tight_layout()
plt.savefig('r_squared_visualization.png', dpi=300, bbox_inches='tight')
plt.show()

The green dashed lines represent residuals—the errors your model makes. R-squared essentially measures how much smaller these residuals are compared to just predicting the mean (blue line).

Calculating R-Squared with Scikit-Learn

Scikit-learn provides a built-in r2_score() function that makes calculating R-squared trivial. Here’s a complete example using a real dataset:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate R-squared
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2:.4f}")

# Alternative: use model's built-in score method
r2_alternative = model.score(X_test, y_test)
print(f"R-squared (alternative): {r2_alternative:.4f}")

The LinearRegression object also has a .score() method that returns R-squared directly, which is convenient but less flexible than using r2_score() explicitly.

Manual R-Squared Implementation

Understanding how R-squared is calculated helps you interpret it better. Here’s a from-scratch implementation:

import numpy as np

def calculate_r2_manual(y_true, y_pred):
    """
    Calculate R-squared manually using the formula:
    R² = 1 - (SS_res / SS_tot)
    """
    # Convert to numpy arrays
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    
    # Calculate residual sum of squares
    ss_res = np.sum((y_true - y_pred) ** 2)
    
    # Calculate total sum of squares
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    
    # Calculate R-squared
    r2 = 1 - (ss_res / ss_tot)
    
    return r2

# Using our previous example
r2_manual = calculate_r2_manual(y_test, y_pred)
r2_sklearn = r2_score(y_test, y_pred)

print(f"Manual R²: {r2_manual:.6f}")
print(f"Sklearn R²: {r2_sklearn:.6f}")
print(f"Difference: {abs(r2_manual - r2_sklearn):.10f}")

This implementation reveals the mechanics: you’re comparing the squared errors of your model against the squared errors of the simplest possible model (predicting the mean). The closer these values are, the worse your R-squared.

R-Squared for Different Model Types

R-squared works across different regression algorithms. Here’s a comparison:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

# Prepare data
X_train_small = X_train[:1000]  # Use subset for faster training
y_train_small = y_train[:1000]

# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Polynomial Regression (degree=2)': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(max_depth=10, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
}

# Train and evaluate
results = {}

for name, model in models.items():
    if 'Polynomial' in name:
        # Create polynomial features
        poly = PolynomialFeatures(degree=2)
        X_train_poly = poly.fit_transform(X_train_small)
        X_test_poly = poly.transform(X_test)
        model.fit(X_train_poly, y_train_small)
        y_pred = model.predict(X_test_poly)
    else:
        model.fit(X_train_small, y_train_small)
        y_pred = model.predict(X_test)
    
    r2 = r2_score(y_test, y_pred)
    results[name] = r2
    print(f"{name:35} R² = {r2:.4f}")

# Find best model
best_model = max(results, key=results.get)
print(f"\nBest model: {best_model} (R² = {results[best_model]:.4f})")

Different algorithms capture different patterns. Tree-based models often achieve higher R-squared on complex datasets, but this doesn’t automatically make them better—they might be overfitting.

Limitations and Adjusted R-Squared

R-squared has a critical flaw: it always increases when you add more features, even if those features are random noise. This makes it unreliable for comparing models with different numbers of predictors.

Adjusted R-squared penalizes model complexity:

Adjusted R² = 1 - [(1 - R²) × (n - 1) / (n - p - 1)]

Where n is the number of observations and p is the number of features.

def adjusted_r2(y_true, y_pred, n_features):
    """Calculate adjusted R-squared"""
    n = len(y_true)
    r2 = r2_score(y_true, y_pred)
    adjusted = 1 - (1 - r2) * (n - 1) / (n - n_features - 1)
    return adjusted

# Demonstrate the problem with regular R²
np.random.seed(42)
X_original = X_train[:500, :3]  # Use 3 features
y_small = y_train[:500]

# Add 10 random noise features
X_noise = np.random.randn(500, 10)
X_with_noise = np.hstack([X_original, X_noise])

# Train models
model_original = LinearRegression().fit(X_original, y_small)
model_noise = LinearRegression().fit(X_with_noise, y_small)

# Evaluate on test set
y_pred_original = model_original.predict(X_test[:, :3])
y_pred_noise = model_noise.predict(
    np.hstack([X_test[:, :3], np.random.randn(len(X_test), 10)])
)

r2_original = r2_score(y_test, y_pred_original)
r2_noise = r2_score(y_test, y_pred_noise)

adj_r2_original = adjusted_r2(y_test, y_pred_original, 3)
adj_r2_noise = adjusted_r2(y_test, y_pred_noise, 13)

print(f"Model with 3 features:")
print(f"  R² = {r2_original:.4f}, Adjusted R² = {adj_r2_original:.4f}")
print(f"\nModel with 13 features (10 random):")
print(f"  R² = {r2_noise:.4f}, Adjusted R² = {adj_r2_noise:.4f}")

Notice how adjusted R-squared correctly penalizes the model with random features, while regular R-squared might misleadingly suggest improvement.

Best Practices and Alternatives

Never rely on R-squared alone. Use it alongside other metrics for a complete picture:

from sklearn.metrics import mean_squared_error, mean_absolute_error

def comprehensive_evaluation(y_true, y_pred, n_features=None):
    """Complete regression model evaluation"""
    r2 = r2_score(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    
    print(f"R² Score: {r2:.4f}")
    print(f"RMSE: {rmse:.4f}")
    print(f"MAE: {mae:.4f}")
    
    if n_features is not None:
        adj_r2 = adjusted_r2(y_true, y_pred, n_features)
        print(f"Adjusted R²: {adj_r2:.4f}")
    
    # Calculate relative metrics
    mean_target = np.mean(y_true)
    print(f"\nRelative to mean target ({mean_target:.2f}):")
    print(f"  RMSE as % of mean: {(rmse/mean_target)*100:.2f}%")
    print(f"  MAE as % of mean: {(mae/mean_target)*100:.2f}%")

# Evaluate our best model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

comprehensive_evaluation(y_test, y_pred, n_features=X_train.shape[1])

Key guidelines:

Use R² for quick model comparison during development, but don’t stop there
Prefer adjusted R² when comparing models with different feature counts
Check RMSE/MAE to understand prediction errors in the original units
Watch for negative R² on test data—it means your model is worse than predicting the mean
Don’t chase perfect R²—values above 0.95 often indicate overfitting or data leakage
Consider domain context—R² of 0.70 might be excellent in noisy domains like social sciences but poor for physics simulations

R-squared is a starting point, not the finish line. Use it to guide your modeling decisions, but always validate with multiple metrics and domain expertise.