R - Logistic Regression (glm)
Logistic regression models the probability of a binary outcome using a logistic function. Unlike linear regression, which predicts continuous values, logistic regression outputs probabilities...
Key Insights
- Logistic regression uses the
glm()function withfamily = binomial(link = "logit")to model binary outcomes, transforming linear predictions through the logistic function to produce probabilities between 0 and 1 - Model evaluation requires examining coefficients, odds ratios, confusion matrices, and ROC curves rather than R-squared values used in linear regression
- Handling class imbalance, multicollinearity, and proper train-test splitting are critical for building production-ready logistic regression models
Understanding Logistic Regression in R
Logistic regression models the probability of a binary outcome using a logistic function. Unlike linear regression, which predicts continuous values, logistic regression outputs probabilities constrained between 0 and 1. The model uses maximum likelihood estimation rather than ordinary least squares.
The logistic function transforms the linear combination of predictors:
# Basic logistic regression syntax
model <- glm(outcome ~ predictor1 + predictor2,
data = dataset,
family = binomial(link = "logit"))
Building Your First Model
Start with a practical example using credit default prediction:
# Load required libraries
library(tidyverse)
library(caret)
# Create sample dataset
set.seed(123)
n <- 1000
credit_data <- data.frame(
age = rnorm(n, 45, 12),
income = rnorm(n, 50000, 15000),
debt_ratio = runif(n, 0, 1),
credit_score = rnorm(n, 650, 100),
num_late_payments = rpois(n, 2)
)
# Generate binary outcome with realistic probabilities
credit_data$default <- rbinom(n, 1,
prob = plogis(-5 +
0.02 * credit_data$age +
-0.00003 * credit_data$income +
3 * credit_data$debt_ratio +
-0.01 * credit_data$credit_score +
0.3 * credit_data$num_late_payments))
# Split data
train_index <- createDataPartition(credit_data$default, p = 0.7, list = FALSE)
train_data <- credit_data[train_index, ]
test_data <- credit_data[-train_index, ]
# Fit logistic regression
model <- glm(default ~ age + income + debt_ratio + credit_score + num_late_payments,
data = train_data,
family = binomial(link = "logit"))
# View summary
summary(model)
Interpreting Coefficients and Odds Ratios
Coefficients in logistic regression represent log-odds. Convert them to odds ratios for interpretability:
# Extract coefficients
coefficients <- coef(model)
print(coefficients)
# Calculate odds ratios
odds_ratios <- exp(coefficients)
print(odds_ratios)
# Create interpretable summary
coef_summary <- data.frame(
Variable = names(coefficients),
Coefficient = coefficients,
OddsRatio = odds_ratios,
Interpretation = c(
"Baseline odds",
sprintf("%.2f%% change per year", (odds_ratios[2] - 1) * 100),
sprintf("%.4f%% change per dollar", (odds_ratios[3] - 1) * 100),
sprintf("%.1fx odds per unit increase", odds_ratios[4]),
sprintf("%.2f%% change per point", (odds_ratios[5] - 1) * 100),
sprintf("%.1fx odds per late payment", odds_ratios[6])
)
)
print(coef_summary)
# Confidence intervals for odds ratios
conf_intervals <- exp(confint(model))
print(conf_intervals)
Making Predictions
Generate predictions on test data and convert log-odds to probabilities:
# Predict probabilities
test_data$predicted_prob <- predict(model,
newdata = test_data,
type = "response")
# Predict classes using 0.5 threshold
test_data$predicted_class <- ifelse(test_data$predicted_prob > 0.5, 1, 0)
# View predictions
head(test_data[, c("default", "predicted_prob", "predicted_class")], 10)
# Custom threshold based on business requirements
threshold <- 0.3 # More conservative for credit risk
test_data$predicted_class_conservative <- ifelse(test_data$predicted_prob > threshold, 1, 0)
Model Evaluation Metrics
Assess model performance using multiple metrics:
# Confusion matrix
conf_matrix <- table(Predicted = test_data$predicted_class,
Actual = test_data$default)
print(conf_matrix)
# Calculate metrics manually
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
precision <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
recall <- conf_matrix[2, 2] / sum(conf_matrix[, 2])
f1_score <- 2 * (precision * recall) / (precision + recall)
metrics <- data.frame(
Metric = c("Accuracy", "Precision", "Recall", "F1-Score"),
Value = c(accuracy, precision, recall, f1_score)
)
print(metrics)
# Using caret for comprehensive metrics
confusionMatrix(factor(test_data$predicted_class),
factor(test_data$default),
positive = "1")
ROC Curve and AUC
ROC curves visualize the trade-off between sensitivity and specificity:
library(pROC)
# Create ROC object
roc_obj <- roc(test_data$default, test_data$predicted_prob)
# Plot ROC curve
plot(roc_obj,
main = "ROC Curve for Credit Default Model",
col = "blue",
lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red")
# Calculate AUC
auc_value <- auc(roc_obj)
text(0.6, 0.2, paste("AUC =", round(auc_value, 3)), cex = 1.2)
# Find optimal threshold
coords_obj <- coords(roc_obj, "best", best.method = "youden")
print(coords_obj)
Handling Class Imbalance
Real-world datasets often have imbalanced classes. Address this with sampling or weights:
# Check class distribution
table(train_data$default)
# Method 1: Downsampling majority class
library(caret)
balanced_data <- downSample(x = train_data[, -which(names(train_data) == "default")],
y = factor(train_data$default))
names(balanced_data)[names(balanced_data) == "Class"] <- "default"
model_balanced <- glm(default ~ age + income + debt_ratio + credit_score + num_late_payments,
data = balanced_data,
family = binomial(link = "logit"))
# Method 2: Class weights
weights <- ifelse(train_data$default == 1,
sum(train_data$default == 0) / sum(train_data$default == 1),
1)
model_weighted <- glm(default ~ age + income + debt_ratio + credit_score + num_late_payments,
data = train_data,
family = binomial(link = "logit"),
weights = weights)
Checking Model Assumptions
Validate logistic regression assumptions:
# 1. Linearity of logit (continuous predictors)
# Create logit of predicted probabilities
train_data$logit_pred <- log(predict(model, type = "response") /
(1 - predict(model, type = "response")))
# Plot continuous predictors vs logit
par(mfrow = c(2, 2))
plot(train_data$age, train_data$logit_pred, main = "Age vs Logit")
plot(train_data$income, train_data$logit_pred, main = "Income vs Logit")
plot(train_data$debt_ratio, train_data$logit_pred, main = "Debt Ratio vs Logit")
plot(train_data$credit_score, train_data$logit_pred, main = "Credit Score vs Logit")
# 2. Multicollinearity check
library(car)
vif_values <- vif(model)
print(vif_values)
# VIF > 10 indicates problematic multicollinearity
if(any(vif_values > 10)) {
warning("High multicollinearity detected")
}
# 3. Influential observations
cooksd <- cooks.distance(model)
influential <- which(cooksd > 4/nrow(train_data))
print(paste("Number of influential points:", length(influential)))
Feature Engineering and Interactions
Improve model performance with engineered features:
# Create interaction terms
model_interaction <- glm(default ~ age + income + debt_ratio + credit_score +
num_late_payments + debt_ratio:credit_score +
age:income,
data = train_data,
family = binomial(link = "logit"))
# Polynomial terms for non-linear relationships
model_poly <- glm(default ~ poly(age, 2) + poly(income, 2) + debt_ratio +
credit_score + num_late_payments,
data = train_data,
family = binomial(link = "logit"))
# Compare models using AIC
aic_comparison <- data.frame(
Model = c("Base", "Interaction", "Polynomial"),
AIC = c(AIC(model), AIC(model_interaction), AIC(model_poly))
)
print(aic_comparison)
# Likelihood ratio test
anova(model, model_interaction, test = "Chisq")
Production Deployment
Prepare models for production use:
# Save model
saveRDS(model, "credit_default_model.rds")
# Load and predict function
predict_default <- function(new_data, model_path = "credit_default_model.rds") {
model <- readRDS(model_path)
predictions <- predict(model, newdata = new_data, type = "response")
return(data.frame(
probability = predictions,
prediction = ifelse(predictions > 0.5, "Default", "No Default")
))
}
# Test prediction function
new_customer <- data.frame(
age = 35,
income = 60000,
debt_ratio = 0.4,
credit_score = 680,
num_late_payments = 1
)
result <- predict_default(new_customer)
print(result)
Logistic regression in R provides a robust foundation for binary classification. Master coefficient interpretation, proper evaluation metrics, and assumption checking to build reliable models for production systems.