Logistic Regression: Complete Guide with Examples

Key Insights

Logistic regression transforms linear combinations of features into probabilities using the sigmoid function, making it ideal for classification despite its misleading name
The algorithm optimizes log loss rather than mean squared error because we’re predicting probabilities, not continuous values—MSE would create non-convex optimization landscapes
Feature scaling is critical for logistic regression since the algorithm uses gradient descent, and unscaled features can dominate the learning process and slow convergence

Introduction to Logistic Regression

Despite its name, logistic regression is a classification algorithm, not a regression technique. It predicts the probability that an instance belongs to a particular class, making it one of the most widely used algorithms for binary classification problems like spam detection, disease diagnosis, or customer churn prediction.

The key difference from linear regression is the output. Linear regression predicts continuous values directly, while logistic regression squashes predictions into a probability range between 0 and 1 using the sigmoid function. This makes it suitable for answering yes/no questions: Will this customer churn? Is this email spam? Does this patient have the disease?

Use logistic regression when you need interpretable results with probabilistic outputs. It works well with linearly separable data and provides clear insights into feature importance through coefficient weights. However, if your classes have complex, non-linear boundaries, consider tree-based methods or neural networks instead.

Mathematical Foundation

The magic of logistic regression lies in the sigmoid (logistic) function, which transforms any real-valued number into a probability:

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Visualize the sigmoid function
z = np.linspace(-10, 10, 100)
plt.figure(figsize=(10, 6))
plt.plot(z, sigmoid(z), linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', label='Decision boundary')
plt.axvline(x=0, color='r', linestyle='--')
plt.xlabel('z (linear combination of features)', fontsize=12)
plt.ylabel('σ(z) - Probability', fontsize=12)
plt.title('Sigmoid Function', fontsize=14)
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

The sigmoid function takes a linear combination of features (z = w₀ + w₁x₁ + w₂x₂ + …) and outputs a probability. When z = 0, the probability is exactly 0.5, creating our decision boundary. Positive z values push probabilities toward 1, while negative values push toward 0.

We optimize using log loss (binary cross-entropy) rather than mean squared error. MSE creates a non-convex cost surface with multiple local minima when combined with the sigmoid function, making optimization difficult. Log loss produces a convex surface that gradient descent can reliably minimize:

Cost = -1/m * Σ[y*log(ŷ) + (1-y)*log(1-ŷ)]

This cost function heavily penalizes confident wrong predictions, which is exactly what we want for classification.

Building Logistic Regression from Scratch

Understanding the internals helps you troubleshoot issues and appreciate what libraries do under the hood. Here’s a complete implementation:

class LogisticRegressionScratch:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.lr = learning_rate
        self.iterations = iterations
        self.weights = None
        self.bias = None
        
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))  # Clip to prevent overflow
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradient descent
        for i in range(self.iterations):
            # Forward pass
            linear_pred = np.dot(X, self.weights) + self.bias
            predictions = self.sigmoid(linear_pred)
            
            # Compute gradients
            dw = (1/n_samples) * np.dot(X.T, (predictions - y))
            db = (1/n_samples) * np.sum(predictions - y)
            
            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
            
            # Optional: print cost every 100 iterations
            if i % 100 == 0:
                cost = -np.mean(y*np.log(predictions) + (1-y)*np.log(1-predictions))
                print(f"Iteration {i}, Cost: {cost:.4f}")
    
    def predict(self, X):
        linear_pred = np.dot(X, self.weights) + self.bias
        y_pred = self.sigmoid(linear_pred)
        return (y_pred >= 0.5).astype(int)
    
    def predict_proba(self, X):
        linear_pred = np.dot(X, self.weights) + self.bias
        return self.sigmoid(linear_pred)

# Generate synthetic dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, 
                          n_informative=2, n_clusters_per_class=1, 
                          random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegressionScratch(learning_rate=0.1, iterations=1000)
model.fit(X_train, y_train)

# Visualize decision boundary
def plot_decision_boundary(model, X, y):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.figure(figsize=(10, 6))
    plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='black')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('Decision Boundary')
    plt.show()

plot_decision_boundary(model, X_test, y_test)

Using Scikit-learn for Logistic Regression

For production use, sklearn provides a robust, optimized implementation with many solver options:

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model with L2 regularization
model = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)

print(f"Training accuracy: {model.score(X_train_scaled, y_train):.3f}")
print(f"Test accuracy: {model.score(X_test_scaled, y_test):.3f}")

# Hyperparameter tuning
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, 
                          cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")

The C parameter controls regularization strength (inverse of regularization). Smaller values mean stronger regularization, which helps prevent overfitting. The solver parameter determines the optimization algorithm—liblinear works well for small datasets, while lbfgs is better for larger ones.

Model Evaluation and Interpretation

Accuracy alone is misleading for classification. Use a comprehensive evaluation strategy:

from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
import seaborn as sns

# Get predictions
y_pred = grid_search.predict(X_test_scaled)
y_pred_proba = grid_search.predict_proba(X_test_scaled)[:, 1]

# Classification report
print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, linewidth=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Feature importance
feature_importance = pd.DataFrame({
    'feature': data.feature_names,
    'coefficient': grid_search.best_estimator_.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)

plt.figure(figsize=(10, 8))
plt.barh(feature_importance['feature'][:10], feature_importance['coefficient'][:10])
plt.xlabel('Coefficient Value')
plt.title('Top 10 Most Important Features')
plt.tight_layout()
plt.show()

Positive coefficients increase the probability of the positive class, while negative coefficients decrease it. The magnitude indicates strength of influence.

Multi-class Logistic Regression

Logistic regression extends to multiple classes using one-vs-rest (OvR) or multinomial approaches:

from sklearn.datasets import load_iris

# Load iris dataset (3 classes)
iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Multinomial logistic regression
multi_model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
multi_model.fit(X_train_scaled, y_train)

print(f"Multi-class accuracy: {multi_model.score(X_test_scaled, y_test):.3f}")

# Visualize probabilities for each class
proba = multi_model.predict_proba(X_test_scaled)
print("\nProbabilities for first 5 test samples:")
print(proba[:5])

The multinomial option trains a single model that outputs probabilities for all classes simultaneously, while ovr trains separate binary classifiers for each class.

Common Pitfalls and Best Practices

Class imbalance is the most common issue. When one class dominates, the model learns to predict the majority class. Handle this with class weights:

from sklearn.utils.class_weight import compute_class_weight

# Compute class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weight_dict = dict(enumerate(class_weights))

# Train with balanced weights
balanced_model = LogisticRegression(class_weight='balanced', max_iter=1000)
balanced_model.fit(X_train_scaled, y_train)

Feature scaling is non-negotiable. Without it, features with larger ranges dominate gradient updates, slowing convergence and potentially causing poor performance.

Multicollinearity inflates coefficient variance. Check correlation matrices and consider removing highly correlated features or using regularization.

Logistic regression assumes linear decision boundaries. If your data has complex, non-linear patterns, you’ll need polynomial features or a different algorithm entirely. Always visualize your decision boundaries when possible to verify the model makes sense for your problem.