R - Logistic Regression (glm)

Logistic regression models the probability of a binary outcome using a logistic function. Unlike linear regression, which predicts continuous values, logistic regression outputs probabilities...

Key Insights

  • Logistic regression uses the glm() function with family = binomial(link = "logit") to model binary outcomes, transforming linear predictions through the logistic function to produce probabilities between 0 and 1
  • Model evaluation requires examining coefficients, odds ratios, confusion matrices, and ROC curves rather than R-squared values used in linear regression
  • Handling class imbalance, multicollinearity, and proper train-test splitting are critical for building production-ready logistic regression models

Understanding Logistic Regression in R

Logistic regression models the probability of a binary outcome using a logistic function. Unlike linear regression, which predicts continuous values, logistic regression outputs probabilities constrained between 0 and 1. The model uses maximum likelihood estimation rather than ordinary least squares.

The logistic function transforms the linear combination of predictors:

# Basic logistic regression syntax
model <- glm(outcome ~ predictor1 + predictor2, 
             data = dataset, 
             family = binomial(link = "logit"))

Building Your First Model

Start with a practical example using credit default prediction:

# Load required libraries
library(tidyverse)
library(caret)

# Create sample dataset
set.seed(123)
n <- 1000
credit_data <- data.frame(
  age = rnorm(n, 45, 12),
  income = rnorm(n, 50000, 15000),
  debt_ratio = runif(n, 0, 1),
  credit_score = rnorm(n, 650, 100),
  num_late_payments = rpois(n, 2)
)

# Generate binary outcome with realistic probabilities
credit_data$default <- rbinom(n, 1, 
  prob = plogis(-5 + 
    0.02 * credit_data$age +
    -0.00003 * credit_data$income +
    3 * credit_data$debt_ratio +
    -0.01 * credit_data$credit_score +
    0.3 * credit_data$num_late_payments))

# Split data
train_index <- createDataPartition(credit_data$default, p = 0.7, list = FALSE)
train_data <- credit_data[train_index, ]
test_data <- credit_data[-train_index, ]

# Fit logistic regression
model <- glm(default ~ age + income + debt_ratio + credit_score + num_late_payments,
             data = train_data,
             family = binomial(link = "logit"))

# View summary
summary(model)

Interpreting Coefficients and Odds Ratios

Coefficients in logistic regression represent log-odds. Convert them to odds ratios for interpretability:

# Extract coefficients
coefficients <- coef(model)
print(coefficients)

# Calculate odds ratios
odds_ratios <- exp(coefficients)
print(odds_ratios)

# Create interpretable summary
coef_summary <- data.frame(
  Variable = names(coefficients),
  Coefficient = coefficients,
  OddsRatio = odds_ratios,
  Interpretation = c(
    "Baseline odds",
    sprintf("%.2f%% change per year", (odds_ratios[2] - 1) * 100),
    sprintf("%.4f%% change per dollar", (odds_ratios[3] - 1) * 100),
    sprintf("%.1fx odds per unit increase", odds_ratios[4]),
    sprintf("%.2f%% change per point", (odds_ratios[5] - 1) * 100),
    sprintf("%.1fx odds per late payment", odds_ratios[6])
  )
)
print(coef_summary)

# Confidence intervals for odds ratios
conf_intervals <- exp(confint(model))
print(conf_intervals)

Making Predictions

Generate predictions on test data and convert log-odds to probabilities:

# Predict probabilities
test_data$predicted_prob <- predict(model, 
                                    newdata = test_data, 
                                    type = "response")

# Predict classes using 0.5 threshold
test_data$predicted_class <- ifelse(test_data$predicted_prob > 0.5, 1, 0)

# View predictions
head(test_data[, c("default", "predicted_prob", "predicted_class")], 10)

# Custom threshold based on business requirements
threshold <- 0.3  # More conservative for credit risk
test_data$predicted_class_conservative <- ifelse(test_data$predicted_prob > threshold, 1, 0)

Model Evaluation Metrics

Assess model performance using multiple metrics:

# Confusion matrix
conf_matrix <- table(Predicted = test_data$predicted_class, 
                     Actual = test_data$default)
print(conf_matrix)

# Calculate metrics manually
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
precision <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
recall <- conf_matrix[2, 2] / sum(conf_matrix[, 2])
f1_score <- 2 * (precision * recall) / (precision + recall)

metrics <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall", "F1-Score"),
  Value = c(accuracy, precision, recall, f1_score)
)
print(metrics)

# Using caret for comprehensive metrics
confusionMatrix(factor(test_data$predicted_class), 
                factor(test_data$default),
                positive = "1")

ROC Curve and AUC

ROC curves visualize the trade-off between sensitivity and specificity:

library(pROC)

# Create ROC object
roc_obj <- roc(test_data$default, test_data$predicted_prob)

# Plot ROC curve
plot(roc_obj, 
     main = "ROC Curve for Credit Default Model",
     col = "blue", 
     lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red")

# Calculate AUC
auc_value <- auc(roc_obj)
text(0.6, 0.2, paste("AUC =", round(auc_value, 3)), cex = 1.2)

# Find optimal threshold
coords_obj <- coords(roc_obj, "best", best.method = "youden")
print(coords_obj)

Handling Class Imbalance

Real-world datasets often have imbalanced classes. Address this with sampling or weights:

# Check class distribution
table(train_data$default)

# Method 1: Downsampling majority class
library(caret)
balanced_data <- downSample(x = train_data[, -which(names(train_data) == "default")],
                            y = factor(train_data$default))
names(balanced_data)[names(balanced_data) == "Class"] <- "default"

model_balanced <- glm(default ~ age + income + debt_ratio + credit_score + num_late_payments,
                      data = balanced_data,
                      family = binomial(link = "logit"))

# Method 2: Class weights
weights <- ifelse(train_data$default == 1, 
                  sum(train_data$default == 0) / sum(train_data$default == 1), 
                  1)

model_weighted <- glm(default ~ age + income + debt_ratio + credit_score + num_late_payments,
                      data = train_data,
                      family = binomial(link = "logit"),
                      weights = weights)

Checking Model Assumptions

Validate logistic regression assumptions:

# 1. Linearity of logit (continuous predictors)
# Create logit of predicted probabilities
train_data$logit_pred <- log(predict(model, type = "response") / 
                              (1 - predict(model, type = "response")))

# Plot continuous predictors vs logit
par(mfrow = c(2, 2))
plot(train_data$age, train_data$logit_pred, main = "Age vs Logit")
plot(train_data$income, train_data$logit_pred, main = "Income vs Logit")
plot(train_data$debt_ratio, train_data$logit_pred, main = "Debt Ratio vs Logit")
plot(train_data$credit_score, train_data$logit_pred, main = "Credit Score vs Logit")

# 2. Multicollinearity check
library(car)
vif_values <- vif(model)
print(vif_values)

# VIF > 10 indicates problematic multicollinearity
if(any(vif_values > 10)) {
  warning("High multicollinearity detected")
}

# 3. Influential observations
cooksd <- cooks.distance(model)
influential <- which(cooksd > 4/nrow(train_data))
print(paste("Number of influential points:", length(influential)))

Feature Engineering and Interactions

Improve model performance with engineered features:

# Create interaction terms
model_interaction <- glm(default ~ age + income + debt_ratio + credit_score + 
                          num_late_payments + debt_ratio:credit_score + 
                          age:income,
                        data = train_data,
                        family = binomial(link = "logit"))

# Polynomial terms for non-linear relationships
model_poly <- glm(default ~ poly(age, 2) + poly(income, 2) + debt_ratio + 
                   credit_score + num_late_payments,
                  data = train_data,
                  family = binomial(link = "logit"))

# Compare models using AIC
aic_comparison <- data.frame(
  Model = c("Base", "Interaction", "Polynomial"),
  AIC = c(AIC(model), AIC(model_interaction), AIC(model_poly))
)
print(aic_comparison)

# Likelihood ratio test
anova(model, model_interaction, test = "Chisq")

Production Deployment

Prepare models for production use:

# Save model
saveRDS(model, "credit_default_model.rds")

# Load and predict function
predict_default <- function(new_data, model_path = "credit_default_model.rds") {
  model <- readRDS(model_path)
  predictions <- predict(model, newdata = new_data, type = "response")
  return(data.frame(
    probability = predictions,
    prediction = ifelse(predictions > 0.5, "Default", "No Default")
  ))
}

# Test prediction function
new_customer <- data.frame(
  age = 35,
  income = 60000,
  debt_ratio = 0.4,
  credit_score = 680,
  num_late_payments = 1
)

result <- predict_default(new_customer)
print(result)

Logistic regression in R provides a robust foundation for binary classification. Master coefficient interpretation, proper evaluation metrics, and assumption checking to build reliable models for production systems.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.