How to Implement Logistic Regression in Python
Logistic regression is fundamentally different from linear regression despite the similar name. While linear regression predicts continuous values, logistic regression is designed for binary...
Key Insights
- Logistic regression is a linear model for binary classification that uses the sigmoid function to map predictions to probabilities between 0 and 1, making it interpretable and efficient for many real-world problems.
- Building logistic regression from scratch with NumPy reveals the underlying gradient descent optimization and helps you understand how regularization prevents overfitting better than just using sklearn as a black box.
- Model evaluation goes beyond accuracy—use precision, recall, F1-score, and ROC-AUC to properly assess performance, especially with imbalanced datasets where accuracy can be misleading.
Introduction to Logistic Regression
Logistic regression is fundamentally different from linear regression despite the similar name. While linear regression predicts continuous values, logistic regression is designed for binary classification—predicting whether an email is spam, if a tumor is malignant, or whether a customer will churn.
The key innovation is the sigmoid function, which transforms any real-valued number into a probability between 0 and 1:
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Visualize the sigmoid function
z = np.linspace(-10, 10, 100)
plt.figure(figsize=(10, 6))
plt.plot(z, sigmoid(z), linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', label='Decision boundary')
plt.axvline(x=0, color='r', linestyle='--')
plt.xlabel('z (linear combination of features)')
plt.ylabel('σ(z) - Probability')
plt.title('Sigmoid Function')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()
The sigmoid function has elegant mathematical properties: it’s differentiable everywhere, asymptotically approaches 0 and 1, and its derivative is simple to compute—crucial for gradient descent optimization.
Understanding the Mathematics
Logistic regression works by computing a weighted sum of input features (just like linear regression), then passing this through the sigmoid function to get a probability. For a binary classification with threshold 0.5, we predict class 1 if the probability exceeds 0.5, otherwise class 0.
The cost function for logistic regression is binary cross-entropy (log loss):
def binary_cross_entropy(y_true, y_pred):
"""
Calculate binary cross-entropy loss
y_true: actual labels (0 or 1)
y_pred: predicted probabilities
"""
epsilon = 1e-15 # Prevent log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
# Example
y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.3])
loss = binary_cross_entropy(y_true, y_pred)
print(f"Loss: {loss:.4f}") # Lower is better
This loss function penalizes confident wrong predictions heavily. If the true label is 1 but you predict 0.01, the loss explodes. This encourages the model to be both accurate and appropriately confident.
Implementation from Scratch
Building logistic regression from scratch solidifies your understanding. Here’s a complete implementation using only NumPy:
import numpy as np
class LogisticRegression:
def __init__(self, learning_rate=0.01, n_iterations=1000):
self.learning_rate = learning_rate
self.n_iterations = n_iterations
self.weights = None
self.bias = None
self.losses = []
def _sigmoid(self, z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500))) # Clip to prevent overflow
def fit(self, X, y):
n_samples, n_features = X.shape
# Initialize parameters
self.weights = np.zeros(n_features)
self.bias = 0
# Gradient descent
for i in range(self.n_iterations):
# Forward pass
linear_pred = np.dot(X, self.weights) + self.bias
predictions = self._sigmoid(linear_pred)
# Compute gradients
dw = (1 / n_samples) * np.dot(X.T, (predictions - y))
db = (1 / n_samples) * np.sum(predictions - y)
# Update parameters
self.weights -= self.learning_rate * dw
self.bias -= self.learning_rate * db
# Track loss
if i % 100 == 0:
loss = -np.mean(y * np.log(predictions + 1e-15) +
(1 - y) * np.log(1 - predictions + 1e-15))
self.losses.append(loss)
return self
def predict_proba(self, X):
linear_pred = np.dot(X, self.weights) + self.bias
return self._sigmoid(linear_pred)
def predict(self, X, threshold=0.5):
return (self.predict_proba(X) >= threshold).astype(int)
# Test the implementation
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(learning_rate=0.1, n_iterations=1000)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = np.mean(predictions == y_test)
print(f"Custom Implementation Accuracy: {accuracy:.4f}")
This implementation demonstrates the core algorithm: iteratively adjusting weights based on the gradient of the loss function. The learning rate controls step size, and more iterations generally improve convergence (up to a point).
Using scikit-learn’s LogisticRegression
For production code, use scikit-learn. It’s optimized, handles edge cases, and includes regularization:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)
# Make predictions
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
print(f"Accuracy: {model.score(X_test_scaled, y_test):.4f}")
Feature scaling is critical for logistic regression. Without it, features with larger magnitudes dominate the optimization, leading to slower convergence and potentially worse performance.
Evaluating Model Performance
Accuracy alone is insufficient, especially with imbalanced datasets. Use multiple metrics:
from sklearn.metrics import (
confusion_matrix, classification_report,
roc_curve, auc, RocCurveDisplay
)
import matplotlib.pyplot as plt
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Detailed metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred,
target_names=['Malignant', 'Benign']))
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, linewidth=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Feature importance
feature_importance = np.abs(model.coef_[0])
top_features = np.argsort(feature_importance)[-10:]
plt.figure(figsize=(10, 6))
plt.barh(range(len(top_features)), feature_importance[top_features])
plt.yticks(range(len(top_features)), data.feature_names[top_features])
plt.xlabel('Absolute Coefficient Value')
plt.title('Top 10 Most Important Features')
plt.tight_layout()
plt.show()
The ROC curve shows the trade-off between true positive rate and false positive rate across different thresholds. AUC (Area Under Curve) summarizes this in a single number—0.5 is random guessing, 1.0 is perfect.
Regularization and Hyperparameter Tuning
Regularization prevents overfitting by penalizing large coefficients. Scikit-learn’s C parameter is the inverse of regularization strength:
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
'C': [0.001, 0.01, 0.1, 1, 10, 100],
'penalty': ['l1', 'l2'],
'solver': ['liblinear'] # Required for L1
}
# Grid search with cross-validation
grid_search = GridSearchCV(
LogisticRegression(max_iter=1000, random_state=42),
param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1
)
grid_search.fit(X_train_scaled, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")
# Evaluate on test set
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test_scaled, y_test)
print(f"Test accuracy: {test_score:.4f}")
L1 regularization (Lasso) can drive coefficients to exactly zero, performing feature selection. L2 regularization (Ridge) shrinks coefficients but rarely eliminates them entirely. For most problems, L2 works well, but try both.
Conclusion and Best Practices
Logistic regression remains relevant despite newer algorithms because it’s fast, interpretable, and works well when the relationship between features and log-odds is approximately linear. Use it as your baseline before trying complex models.
Key best practices:
Always scale your features. Use StandardScaler or MinMaxScaler. Unscaled features cause convergence issues and make coefficient interpretation difficult.
Handle imbalanced data properly. Use class_weight='balanced' in scikit-learn, or oversample/undersample your training data. Don’t rely solely on accuracy.
Start simple, then add complexity. Begin with no regularization, then add L2, then try L1 if you suspect many irrelevant features.
Validate with cross-validation. A single train-test split can be misleading. Use k-fold cross-validation for robust performance estimates.
For next steps, explore polynomial features to capture non-linear relationships, study multinomial logistic regression for multi-class problems, and compare against tree-based models like Random Forests to understand when linear models fall short. The interpretability of logistic regression makes it invaluable for domains where you need to explain predictions to stakeholders.