How to Implement Logistic Regression in R
Logistic regression is a statistical method for binary classification that predicts the probability of an outcome belonging to one of two classes. Despite its name, it's a classification algorithm,...
Key Insights
- Logistic regression uses the sigmoid function to transform linear combinations of features into probabilities between 0 and 1, making it ideal for binary classification problems where you need interpretable results
- The
glm()function withfamily="binomial"is R’s standard implementation, producing coefficients that represent log-odds ratios—exponentiate them to get odds ratios that quantify feature impact - Model evaluation requires more than accuracy: use confusion matrices, ROC curves, and AUC to assess performance, especially with imbalanced datasets where accuracy can be misleading
Introduction to Logistic Regression
Logistic regression is a statistical method for binary classification that predicts the probability of an outcome belonging to one of two classes. Despite its name, it’s a classification algorithm, not a regression technique. Unlike linear regression which predicts continuous values, logistic regression outputs probabilities constrained between 0 and 1.
The core mechanism is the sigmoid (logistic) function that transforms any real-valued number into this probability range:
P(Y=1) = 1 / (1 + e^(-z))
where z is a linear combination of your features (β₀ + β₁X₁ + β₂X₂ + …).
Use logistic regression when you need interpretable results for binary outcomes: customer churn (yes/no), email spam detection (spam/not spam), disease diagnosis (positive/negative), or loan default prediction (default/no default). The coefficients tell you how each feature affects the log-odds of the outcome, making it valuable when stakeholders need to understand why predictions are made.
Dataset Preparation
Let’s work with a practical example using a diabetes dataset. First, install and load necessary packages:
# Install required packages
install.packages(c("caret", "pROC", "MASS"))
# Load libraries
library(caret)
library(pROC)
library(MASS)
# Load the Pima Indians Diabetes dataset
data(Pima.tr)
data(Pima.te)
# Combine for demonstration
diabetes <- rbind(Pima.tr, Pima.te)
Always explore your data before modeling:
# Check structure
str(diabetes)
# Statistical summary
summary(diabetes)
# Check for missing values
sum(is.na(diabetes))
# View class distribution
table(diabetes$type)
prop.table(table(diabetes$type))
The dataset contains 532 observations with 8 predictor variables (glucose levels, BMI, age, etc.) and a binary outcome (diabetes type: Yes/No).
Split your data into training and test sets. Use stratified sampling to maintain class proportions:
# Set seed for reproducibility
set.seed(123)
# Create 80-20 split with stratification
trainIndex <- createDataPartition(diabetes$type, p = 0.8,
list = FALSE,
times = 1)
train_data <- diabetes[trainIndex, ]
test_data <- diabetes[-trainIndex, ]
# Verify split maintained class balance
prop.table(table(train_data$type))
prop.table(table(test_data$type))
Building the Logistic Regression Model
The glm() function (Generalized Linear Model) fits logistic regression when you specify family = binomial:
# Fit logistic regression model
model <- glm(type ~ .,
data = train_data,
family = binomial(link = "logit"))
# Display model summary
summary(model)
The output shows coefficients, standard errors, z-values, and p-values for each predictor. Coefficients represent log-odds ratios. A positive coefficient means higher values of that predictor increase the probability of the positive class (diabetes = Yes).
Interpret coefficients by exponentiating them:
# Get odds ratios
exp(coef(model))
# Get confidence intervals for odds ratios
exp(confint(model))
For example, if the glucose coefficient is 0.038, the odds ratio is exp(0.038) = 1.039. This means each one-unit increase in glucose increases the odds of diabetes by 3.9%, holding other variables constant.
Statistically significant predictors (p < 0.05) are reliable contributors to your model. In this dataset, glucose, BMI, and pedigree function typically show strong significance.
Model Evaluation and Diagnostics
Accuracy alone is insufficient for evaluating classification models. Generate comprehensive metrics:
# Make predictions on test set (probabilities)
test_probs <- predict(model, newdata = test_data, type = "response")
# Convert probabilities to class predictions (threshold = 0.5)
test_pred <- ifelse(test_probs > 0.5, "Yes", "No")
test_pred <- factor(test_pred, levels = c("No", "Yes"))
# Create confusion matrix
conf_matrix <- confusionMatrix(test_pred, test_data$type, positive = "Yes")
print(conf_matrix)
The confusion matrix shows:
- True Positives (TP): Correctly predicted diabetes cases
- True Negatives (TN): Correctly predicted non-diabetes cases
- False Positives (FP): Type I errors (predicted diabetes incorrectly)
- False Negatives (FN): Type II errors (missed diabetes cases)
Key metrics:
- Accuracy: (TP + TN) / Total
- Sensitivity (Recall): TP / (TP + FN) - ability to find positive cases
- Specificity: TN / (TN + FP) - ability to find negative cases
- Precision: TP / (TP + FP) - accuracy of positive predictions
Plot the ROC curve and calculate AUC (Area Under Curve):
# Generate ROC curve
roc_obj <- roc(test_data$type, test_probs)
# Plot ROC curve
plot(roc_obj, main = "ROC Curve for Diabetes Prediction",
col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red")
# Calculate AUC
auc_value <- auc(roc_obj)
print(paste("AUC:", round(auc_value, 3)))
# Add AUC to plot
legend("bottomright", legend = paste("AUC =", round(auc_value, 3)),
col = "blue", lwd = 2)
AUC ranges from 0.5 (random guessing) to 1.0 (perfect classification). Values above 0.7 indicate acceptable performance, above 0.8 is good, and above 0.9 is excellent.
Making Predictions on New Data
Once validated, use your model for predictions on new observations:
# Create a sample new patient
new_patient <- data.frame(
npreg = 2,
glu = 120,
bp = 70,
skin = 30,
bmi = 28.5,
ped = 0.45,
age = 35
)
# Predict probability
pred_prob <- predict(model, newdata = new_patient, type = "response")
print(paste("Probability of diabetes:", round(pred_prob, 3)))
# Classify based on threshold
pred_class <- ifelse(pred_prob > 0.5, "Yes", "No")
print(paste("Prediction:", pred_class))
You can adjust the classification threshold based on your use case. For medical diagnoses where false negatives are costly, use a lower threshold (e.g., 0.3) to increase sensitivity:
# Lower threshold for higher sensitivity
threshold <- 0.3
test_pred_sensitive <- ifelse(test_probs > threshold, "Yes", "No")
test_pred_sensitive <- factor(test_pred_sensitive, levels = c("No", "Yes"))
confusionMatrix(test_pred_sensitive, test_data$type, positive = "Yes")
Model Improvement Techniques
Improve model performance through feature selection. Stepwise selection automatically identifies the most important predictors:
# Stepwise selection (both directions)
step_model <- step(model, direction = "both", trace = 0)
summary(step_model)
# Compare AIC values (lower is better)
print(paste("Original AIC:", AIC(model)))
print(paste("Stepwise AIC:", AIC(step_model)))
For imbalanced datasets where one class dominates, adjust class weights or use sampling techniques:
# Check class imbalance
table(train_data$type)
# Downsample majority class
train_balanced <- downSample(x = train_data[, -8],
y = train_data$type,
yname = "type")
# Fit model on balanced data
model_balanced <- glm(type ~ .,
data = train_balanced,
family = binomial)
Cross-validation provides robust performance estimates:
library(boot)
# Define cost function (misclassification error)
cost <- function(r, pi = 0) mean(abs(r - pi) > 0.5)
# Perform 10-fold cross-validation
cv_error <- cv.glm(train_data, model, cost, K = 10)
print(paste("CV Error:", round(cv_error$delta[1], 3)))
Logistic regression in R is straightforward yet powerful. The glm() function provides everything needed for binary classification, from model fitting to coefficient interpretation. Focus on proper evaluation metrics beyond accuracy, especially when working with imbalanced data or high-stakes predictions. The interpretability of logistic regression makes it invaluable when you need to explain model decisions to non-technical stakeholders—a critical requirement in fields like healthcare, finance, and marketing.