How to Perform Logistic Regression in Python with statsmodels
Logistic regression is the workhorse of binary classification. When your target variable has two outcomes—customer churns or stays, email is spam or not, patient has disease or doesn't—logistic...
Key Insights
- statsmodels provides detailed statistical summaries including p-values, confidence intervals, and pseudo R-squared that scikit-learn doesn’t offer, making it ideal for inference-focused logistic regression
- The formula API (
logit("y ~ x1 + x2", data)) handles categorical encoding automatically, while the array API (Logit(y, X)) requires manual preprocessing but offers more control - Always interpret coefficients as log-odds; exponentiate them with
np.exp()to get odds ratios, which are far more intuitive for communicating results to stakeholders
Introduction to Logistic Regression
Logistic regression is the workhorse of binary classification. When your target variable has two outcomes—customer churns or stays, email is spam or not, patient has disease or doesn’t—logistic regression should be your first stop.
Unlike linear regression, which predicts continuous values, logistic regression predicts probabilities bounded between 0 and 1. It does this by applying the logistic (sigmoid) function to a linear combination of features, ensuring outputs make sense as probabilities.
Why statsmodels over scikit-learn? If your goal is prediction at scale, use scikit-learn. But if you need to understand why your model makes predictions—which features matter, how confident you are in those estimates, whether relationships are statistically significant—statsmodels is the right tool. It gives you the full statistical inference toolkit: p-values, confidence intervals, likelihood ratio tests, and diagnostic measures that scikit-learn simply doesn’t provide.
Setting Up Your Environment
First, install the necessary packages:
pip install statsmodels pandas numpy scikit-learn
Now let’s set up our imports and load a dataset. We’ll use a synthetic dataset that mimics a common business scenario: predicting customer churn.
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
# Set random seed for reproducibility
np.random.seed(42)
# Create synthetic customer churn dataset
n_samples = 1000
data = pd.DataFrame({
'tenure_months': np.random.randint(1, 72, n_samples),
'monthly_charges': np.random.uniform(20, 100, n_samples),
'total_charges': np.random.uniform(100, 5000, n_samples),
'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples),
'internet_service': np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples),
})
# Create target with realistic relationships
churn_prob = 1 / (1 + np.exp(-(
-2
- 0.05 * data['tenure_months']
+ 0.02 * data['monthly_charges']
+ 1.5 * (data['contract_type'] == 'Month-to-month')
)))
data['churn'] = (np.random.random(n_samples) < churn_prob).astype(int)
print(data.head())
print(f"\nChurn rate: {data['churn'].mean():.2%}")
Preparing Data for Logistic Regression
Before fitting a model, you need clean, properly encoded data. Logistic regression requires numeric inputs, so categorical variables need encoding.
# Basic exploration
print(data.describe())
print(f"\nMissing values:\n{data.isnull().sum()}")
# Check class balance
print(f"\nTarget distribution:\n{data['churn'].value_counts(normalize=True)}")
# Encode categorical variables for the array-based API
data_encoded = pd.get_dummies(data, columns=['contract_type', 'internet_service'],
drop_first=True) # drop_first avoids multicollinearity
print(f"\nEncoded columns: {list(data_encoded.columns)}")
# Split data
X = data_encoded.drop('churn', axis=1)
y = data_encoded['churn']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"\nTraining set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
A quick note on multicollinearity: logistic regression assumes features aren’t perfectly correlated. Check this with a correlation matrix:
# Check for multicollinearity
correlation_matrix = X_train.corr()
high_corr = np.where(np.abs(correlation_matrix) > 0.8)
high_corr_pairs = [(correlation_matrix.index[x], correlation_matrix.columns[y],
correlation_matrix.iloc[x, y])
for x, y in zip(*high_corr) if x != y and x < y]
if high_corr_pairs:
print("High correlation pairs:")
for pair in high_corr_pairs:
print(f" {pair[0]} - {pair[1]}: {pair[2]:.3f}")
else:
print("No highly correlated feature pairs found.")
Building the Model with statsmodels
statsmodels offers two APIs for logistic regression. Choose based on your workflow preferences.
Array-Based API (statsmodels.api)
This approach requires you to manually prepare your feature matrix and add a constant term for the intercept.
# Add constant (intercept) - statsmodels doesn't add this automatically
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)
# Fit the model
model_array = sm.Logit(y_train, X_train_const)
result_array = model_array.fit()
print(result_array.summary())
Formula-Based API (statsmodels.formula.api)
The formula API is more intuitive if you’re coming from R. It automatically handles categorical encoding and adds the intercept.
# Prepare training data for formula API (needs original categorical columns)
train_data = data.iloc[X_train.index].copy()
# Fit using formula notation
model_formula = smf.logit(
'churn ~ tenure_months + monthly_charges + total_charges + C(contract_type) + C(internet_service)',
data=train_data
)
result_formula = model_formula.fit()
print(result_formula.summary())
The C() wrapper explicitly tells statsmodels to treat variables as categorical. The formula API creates dummy variables automatically, using the first category as the reference level.
Interpreting Model Output
The summary output is dense. Here’s what matters:
# Let's work with the array-based result
print("=" * 60)
print("MODEL INTERPRETATION")
print("=" * 60)
# Key model statistics
print(f"\nPseudo R-squared: {result_array.prsquared:.4f}")
print(f"Log-Likelihood: {result_array.llf:.2f}")
print(f"AIC: {result_array.aic:.2f}")
print(f"BIC: {result_array.bic:.2f}")
# Coefficients and significance
print("\n--- Coefficients and P-values ---")
coef_table = pd.DataFrame({
'Coefficient': result_array.params,
'Std Error': result_array.bse,
'z-value': result_array.tvalues,
'P-value': result_array.pvalues,
'Significant': result_array.pvalues < 0.05
})
print(coef_table.round(4))
# Confidence intervals
print("\n--- 95% Confidence Intervals ---")
print(result_array.conf_int().round(4))
Understanding the metrics:
- Pseudo R-squared: Not directly comparable to linear regression R². Values between 0.2-0.4 often indicate good fit.
- Log-Likelihood: Higher (less negative) is better. Useful for comparing nested models.
- AIC/BIC: Lower is better. Use these to compare models with different features.
- P-values: Features with p < 0.05 are statistically significant at the 95% confidence level.
Calculating Odds Ratios
Raw coefficients are log-odds, which are hard to interpret. Odds ratios are much more intuitive:
# Calculate odds ratios
odds_ratios = np.exp(result_array.params)
odds_ratio_ci = np.exp(result_array.conf_int())
odds_table = pd.DataFrame({
'Odds Ratio': odds_ratios,
'CI Lower': odds_ratio_ci[0],
'CI Upper': odds_ratio_ci[1]
})
print("\n--- Odds Ratios ---")
print(odds_table.round(4))
# Interpretation example
tenure_or = odds_ratios['tenure_months']
print(f"\nInterpretation: Each additional month of tenure multiplies the odds of churn by {tenure_or:.4f}")
print(f"In other words, each month reduces churn odds by {(1-tenure_or)*100:.1f}%")
An odds ratio of 0.95 for tenure means each additional month reduces the odds of churn by 5%. An odds ratio of 2.0 for a categorical variable means that category has twice the odds of the outcome compared to the reference category.
Model Evaluation and Diagnostics
Now let’s see how well the model actually performs.
# Generate predictions
y_pred_prob = result_array.predict(X_test_const)
y_pred = (y_pred_prob >= 0.5).astype(int)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Metrics
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_prob)
print(f"\nAccuracy: {accuracy:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")
# Threshold tuning - 0.5 isn't always optimal
print("\n--- Threshold Analysis ---")
for threshold in [0.3, 0.4, 0.5, 0.6, 0.7]:
y_pred_thresh = (y_pred_prob >= threshold).astype(int)
acc = accuracy_score(y_test, y_pred_thresh)
cm_thresh = confusion_matrix(y_test, y_pred_thresh)
sensitivity = cm_thresh[1, 1] / (cm_thresh[1, 0] + cm_thresh[1, 1])
specificity = cm_thresh[0, 0] / (cm_thresh[0, 0] + cm_thresh[0, 1])
print(f"Threshold {threshold}: Accuracy={acc:.3f}, Sensitivity={sensitivity:.3f}, Specificity={specificity:.3f}")
For model diagnostics, check the Pearson residuals and look for influential observations:
# Residual analysis
influence = result_array.get_influence()
standardized_residuals = influence.resid_pearson
# Check for outliers (residuals > 3 standard deviations)
outliers = np.abs(standardized_residuals) > 3
print(f"\nNumber of potential outliers: {outliers.sum()}")
# Leverage (influence) check
leverage = influence.hat_matrix_diag
high_leverage = leverage > (2 * len(result_array.params) / len(y_train))
print(f"High leverage points: {high_leverage.sum()}")
Conclusion and Next Steps
You now have a complete workflow for logistic regression with statsmodels:
- Prepare data with proper encoding and train/test splits
- Choose between formula API (convenient) or array API (flexible)
- Interpret coefficients, p-values, and odds ratios
- Evaluate with appropriate metrics and tune thresholds
When to use alternatives:
- scikit-learn: When you need pipelines, cross-validation, or don’t care about statistical inference
- Regularized regression: When you have many features or multicollinearity issues (use
sklearn.linear_model.LogisticRegressionwithpenalty='l1'or'l2') - Other models: When relationships are non-linear or you need higher predictive accuracy (gradient boosting, neural networks)
statsmodels excels when you need to explain your model to stakeholders, publish results in academic contexts, or make decisions based on statistical significance. For production ML systems focused purely on prediction, scikit-learn’s API is more practical. Often, the best approach is using statsmodels for exploration and inference, then switching to scikit-learn for deployment.