How to Perform Logistic Regression in R

Logistic regression is your go-to tool when predicting binary outcomes. Will a customer churn? Is this email spam? Does a patient have a disease? These yes/no questions demand a different approach...

Key Insights

  • Logistic regression uses glm() with family = binomial to model binary outcomes, and interpreting results requires exponentiating coefficients to get meaningful odds ratios.
  • Model evaluation goes beyond accuracy—always examine the confusion matrix, precision, recall, and ROC-AUC to understand performance across different probability thresholds.
  • The default 0.5 probability threshold is rarely optimal; tune it based on your specific use case and the relative costs of false positives versus false negatives.

Introduction to Logistic Regression

Logistic regression is your go-to tool when predicting binary outcomes. Will a customer churn? Is this email spam? Does a patient have a disease? These yes/no questions demand a different approach than linear regression, which assumes continuous outcomes and can produce nonsensical predictions outside the 0-1 range.

The key difference lies in what we’re modeling. Linear regression predicts values directly. Logistic regression predicts the probability of an event occurring, constrained between 0 and 1 through the logistic function. This makes it interpretable, fast, and surprisingly effective for many real-world classification problems.

Despite the rise of complex machine learning algorithms, logistic regression remains a workhorse in production systems. It’s interpretable (you can explain why a prediction was made), computationally efficient, and often performs comparably to more complex models on structured data. Master it before reaching for gradient boosting.

Prerequisites and Data Preparation

Let’s work with the Titanic dataset—a classic for binary classification. Our goal: predict passenger survival based on available features.

# Load required packages
library(tidyverse)
library(caret)
library(pROC)

# Load the Titanic dataset
# Using the built-in Titanic data, converted to a usable format
data("Titanic")
titanic_df <- as.data.frame(Titanic)

# Expand the frequency table into individual observations
titanic <- titanic_df[rep(seq_len(nrow(titanic_df)), titanic_df$Freq), 1:4]
rownames(titanic) <- NULL

# Initial exploration
str(titanic)
summary(titanic)
# Check for missing values
colSums(is.na(titanic))

# Convert Survived to binary numeric (required for some operations)
titanic$Survived_num <- ifelse(titanic$Survived == "Yes", 1, 0)

# Check class balance
table(titanic$Survived)
prop.table(table(titanic$Survived))

Data preparation matters. Logistic regression handles categorical variables automatically in R through dummy encoding, but you need to be aware of the reference level. Check factor levels and set them explicitly if needed:

# Set reference levels explicitly
titanic$Class <- relevel(titanic$Class, ref = "Crew")
titanic$Sex <- relevel(titanic$Sex, ref = "Male")
titanic$Age <- relevel(titanic$Age, ref = "Adult")

Building the Model with glm()

R’s glm() function handles logistic regression through the family = binomial argument. The formula syntax follows the standard R convention: outcome ~ predictor1 + predictor2.

# Build the logistic regression model
model <- glm(Survived ~ Class + Sex + Age, 
             data = titanic, 
             family = binomial(link = "logit"))

# View the model summary
summary(model)

The output looks like this:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.1820     0.1441  -8.203 2.34e-16 ***
Class1st      2.2902     0.1679  13.643  < 2e-16 ***
Class2nd      1.0181     0.1454   7.002 2.53e-12 ***
Class3rd     -0.3577     0.1247  -2.868  0.00413 ** 
SexFemale     2.4201     0.1404  17.236  < 2e-16 ***
AgeChild      1.0615     0.2440   4.351 1.36e-05 ***
---
AIC: 2210.1

Key elements to examine:

  1. Coefficients (Estimate): These are log-odds. Positive values increase the probability of survival; negative values decrease it.
  2. Standard Error: Indicates coefficient precision. Smaller is better.
  3. z value and Pr(>|z|): The z-statistic and p-value for testing whether the coefficient differs from zero. Stars indicate significance levels.
  4. AIC: Akaike Information Criterion—lower values indicate better model fit when comparing models.

Interpreting Coefficients and Odds Ratios

Raw coefficients are log-odds, which aren’t intuitive. Exponentiate them to get odds ratios—a much more interpretable metric.

# Get odds ratios
odds_ratios <- exp(coef(model))
print(odds_ratios)

# Get confidence intervals for odds ratios
conf_int <- exp(confint(model))
print(conf_int)

# Combine into a clean table
or_table <- data.frame(
  OddsRatio = odds_ratios,
  CI_Lower = conf_int[, 1],
  CI_Upper = conf_int[, 2]
)
print(round(or_table, 3))

Interpretation example: If SexFemale has an odds ratio of 11.25, it means females had 11.25 times the odds of survival compared to males, holding other variables constant.

Critical interpretation rules:

  • Odds ratio = 1: No effect
  • Odds ratio > 1: Increased odds of the outcome
  • Odds ratio < 1: Decreased odds of the outcome
  • Confidence interval crossing 1: Effect not statistically significant

Don’t confuse odds ratios with probability multipliers. An odds ratio of 2 doesn’t mean “twice as likely”—it means twice the odds. For rare events, these are approximately equal. For common events, they diverge substantially.

Model Evaluation and Diagnostics

Accuracy alone is misleading, especially with imbalanced classes. Use multiple metrics.

# Generate predictions (probabilities)
titanic$pred_prob <- predict(model, type = "response")

# Convert to class predictions using 0.5 threshold
titanic$pred_class <- ifelse(titanic$pred_prob > 0.5, "Yes", "No")
titanic$pred_class <- factor(titanic$pred_class, levels = c("No", "Yes"))

# Confusion matrix with caret
conf_matrix <- confusionMatrix(titanic$pred_class, titanic$Survived, 
                                positive = "Yes")
print(conf_matrix)

The confusion matrix output includes accuracy, sensitivity (recall), specificity, and precision. Pay attention to all of them:

# Extract specific metrics
accuracy <- conf_matrix$overall["Accuracy"]
precision <- conf_matrix$byClass["Precision"]
recall <- conf_matrix$byClass["Sensitivity"]
f1_score <- conf_matrix$byClass["F1"]

cat("Accuracy:", round(accuracy, 3), "\n")
cat("Precision:", round(precision, 3), "\n")
cat("Recall:", round(recall, 3), "\n")
cat("F1 Score:", round(f1_score, 3), "\n")

ROC curves visualize the trade-off between true positive rate and false positive rate across all thresholds:

# Create ROC curve
roc_obj <- roc(titanic$Survived, titanic$pred_prob)

# Plot ROC curve
plot(roc_obj, main = "ROC Curve for Titanic Survival Model",
     col = "blue", lwd = 2)

# Add AUC to plot
auc_value <- auc(roc_obj)
legend("bottomright", legend = paste("AUC =", round(auc_value, 3)),
       col = "blue", lwd = 2)

# Print AUC
cat("AUC:", round(auc_value, 3), "\n")

An AUC of 0.5 means the model is no better than random guessing. An AUC of 1.0 means perfect discrimination. In practice, 0.7-0.8 is acceptable, 0.8-0.9 is good, and above 0.9 is excellent.

Making Predictions on New Data

Applying your model to new observations is straightforward with predict():

# Create new data for prediction
new_passengers <- data.frame(
  Class = factor(c("1st", "3rd", "2nd"), 
                 levels = levels(titanic$Class)),
  Sex = factor(c("Female", "Male", "Female"), 
               levels = levels(titanic$Sex)),
  Age = factor(c("Adult", "Adult", "Child"), 
               levels = levels(titanic$Age))
)

# Predict probabilities
new_passengers$survival_prob <- predict(model, newdata = new_passengers, 
                                         type = "response")
print(new_passengers)

The default 0.5 threshold often isn’t optimal. Tune it based on your use case:

# Find optimal threshold using Youden's J statistic
coords_result <- coords(roc_obj, "best", best.method = "youden")
optimal_threshold <- coords_result["threshold"]
cat("Optimal threshold:", round(optimal_threshold, 3), "\n")

# Or find threshold that maximizes F1 score
thresholds <- seq(0.1, 0.9, by = 0.05)
f1_scores <- sapply(thresholds, function(t) {
  pred <- ifelse(titanic$pred_prob > t, "Yes", "No")
  pred <- factor(pred, levels = c("No", "Yes"))
  cm <- confusionMatrix(pred, titanic$Survived, positive = "Yes")
  cm$byClass["F1"]
})

best_threshold <- thresholds[which.max(f1_scores)]
cat("Best threshold for F1:", best_threshold, "\n")

Consider the costs of errors. In medical diagnosis, false negatives (missing a disease) might be far worse than false positives. Lower your threshold accordingly. In spam detection, false positives (legitimate email marked as spam) might be more costly. Raise your threshold.

Conclusion and Next Steps

Logistic regression in R follows a clear workflow: prepare your data, fit with glm(), interpret coefficients as odds ratios, evaluate with multiple metrics, and tune your threshold for deployment.

Key takeaways:

  1. Always exponentiate coefficients for interpretable odds ratios
  2. Check confidence intervals—if they cross 1, the effect isn’t significant
  3. Use ROC-AUC for overall model assessment, but choose your threshold based on business requirements
  4. Validate on held-out data before deploying

For next steps, explore multinomial logistic regression with nnet::multinom() for outcomes with more than two categories. When you have many predictors, consider regularized logistic regression with glmnet, which adds L1 or L2 penalties to prevent overfitting and perform automatic feature selection.

Logistic regression isn’t glamorous, but it’s reliable, interpretable, and often good enough. Start here, and only reach for more complex models when you have evidence they’ll perform meaningfully better.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.