Deep Learning: Learning Rate Scheduling Explained

The learning rate is the single most important hyperparameter in neural network training. It controls how much we adjust weights in response to the estimated error gradient. Set it too high, and your...

Key Insights

  • Learning rate scheduling can reduce training time by 30-50% compared to fixed learning rates by allowing aggressive early learning followed by fine-grained convergence
  • The optimal schedule depends heavily on your architecture and dataset—transformers benefit from warm-up followed by decay, while CNNs often perform best with step decay or cosine annealing
  • Always use a learning rate finder before training to establish your initial learning rate and maximum bounds, rather than relying on default values from papers

Introduction to Learning Rate Scheduling

The learning rate is the single most important hyperparameter in neural network training. It controls how much we adjust weights in response to the estimated error gradient. Set it too high, and your model oscillates wildly or diverges. Set it too low, and training crawls along, potentially getting stuck in poor local minima.

The problem with fixed learning rates is that what works at the beginning of training doesn’t work at the end. Early in training, you want large steps to quickly navigate the loss landscape. Later, you need smaller steps to settle into a good minimum without overshooting. This is where learning rate scheduling comes in—systematically adjusting the learning rate during training to get the best of both worlds.

Here’s a simple demonstration of the impact:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

# Simple model and data
model = nn.Sequential(nn.Linear(10, 50), nn.ReLU(), nn.Linear(50, 1))
criterion = nn.MSELoss()
X = torch.randn(1000, 10)
y = torch.randn(1000, 1)

def train(lr_schedule=None, epochs=100):
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
    losses = []
    
    for epoch in range(epochs):
        if lr_schedule:
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr_schedule(epoch)
        
        optimizer.zero_grad()
        loss = criterion(model(X), y)
        loss.backward()
        optimizer.step()
        losses.append(loss.item())
    
    return losses

# Fixed learning rate
fixed_losses = train()

# Scheduled learning rate (simple decay)
scheduled_losses = train(lr_schedule=lambda epoch: 0.1 * (0.95 ** epoch))

plt.plot(fixed_losses, label='Fixed LR')
plt.plot(scheduled_losses, label='Scheduled LR')
plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Loss')

The scheduled version typically converges faster and to a lower final loss.

Common Learning Rate Scheduling Strategies

PyTorch provides several built-in schedulers that cover most use cases. Let’s examine the most effective ones.

Step Decay reduces the learning rate by a factor every N epochs. This is the simplest approach and works surprisingly well for CNNs.

Exponential Decay smoothly decreases the learning rate by a constant factor each epoch, providing more gradual adjustment than step decay.

Cosine Annealing follows a cosine curve, providing smooth decay that’s become popular for training transformers and ResNets.

ReduceLROnPlateau is reactive rather than proactive—it monitors a metric and reduces learning rate when progress stalls.

import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau
import numpy as np

optimizer = optim.Adam(model.parameters(), lr=0.1)
epochs = 100

# Step Decay: reduce LR by 0.1 every 30 epochs
step_scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# Exponential Decay: multiply LR by 0.95 each epoch
exp_scheduler = ExponentialLR(optimizer, gamma=0.95)

# Cosine Annealing: smooth decay following cosine curve
cosine_scheduler = CosineAnnealingLR(optimizer, T_max=epochs, eta_min=1e-6)

# Reduce on Plateau: decrease when validation loss plateaus
plateau_scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, 
                                      patience=10, verbose=True)

# Visualize the schedules
def get_lr_curve(scheduler, epochs):
    lrs = []
    for epoch in range(epochs):
        lrs.append(optimizer.param_groups[0]['lr'])
        optimizer.step()
        scheduler.step()
    return lrs

# Reset optimizer for each scheduler
schedulers = {
    'Step': StepLR(optim.Adam(model.parameters(), lr=0.1), step_size=30, gamma=0.1),
    'Exponential': ExponentialLR(optim.Adam(model.parameters(), lr=0.1), gamma=0.95),
    'Cosine': CosineAnnealingLR(optim.Adam(model.parameters(), lr=0.1), T_max=epochs)
}

for name, sched in schedulers.items():
    opt = sched.optimizer
    lrs = [opt.param_groups[0]['lr']]
    for _ in range(epochs - 1):
        sched.step()
        lrs.append(opt.param_groups[0]['lr'])
    plt.plot(lrs, label=name)

plt.legend()
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.yscale('log')

For most computer vision tasks, I recommend starting with cosine annealing. It’s smooth, predictable, and widely validated across architectures.

Warm-up Strategies

Large batch training and transformer architectures often fail with high initial learning rates. The solution is warm-up: gradually increasing the learning rate from a small value to your target over the first few epochs.

The theory is that early in training, the loss landscape is poorly understood. Large updates based on initial gradients can send the model in completely wrong directions. Warm-up gives the model time to stabilize before aggressive learning begins.

class WarmupCosineSchedule:
    def __init__(self, optimizer, warmup_epochs, total_epochs, 
                 base_lr, min_lr=1e-6):
        self.optimizer = optimizer
        self.warmup_epochs = warmup_epochs
        self.total_epochs = total_epochs
        self.base_lr = base_lr
        self.min_lr = min_lr
        self.current_epoch = 0
    
    def step(self):
        if self.current_epoch < self.warmup_epochs:
            # Linear warm-up
            lr = self.base_lr * (self.current_epoch + 1) / self.warmup_epochs
        else:
            # Cosine annealing
            progress = (self.current_epoch - self.warmup_epochs) / \
                       (self.total_epochs - self.warmup_epochs)
            lr = self.min_lr + (self.base_lr - self.min_lr) * \
                 0.5 * (1 + np.cos(np.pi * progress))
        
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr
        
        self.current_epoch += 1
        return lr

# Usage
optimizer = optim.AdamW(model.parameters(), lr=1e-4)
scheduler = WarmupCosineSchedule(optimizer, warmup_epochs=10, 
                                 total_epochs=100, base_lr=1e-3)

# Visualize
lrs = [scheduler.step() for _ in range(100)]
plt.plot(lrs)
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('Warm-up + Cosine Annealing')

For transformers, use 5-10% of total epochs for warm-up. For CNNs, warm-up is usually unnecessary unless you’re using very large batches (>1024).

Cyclical Learning Rates

Cyclical learning rates take a counterintuitive approach: instead of monotonically decreasing, the learning rate cycles between bounds. The 1cycle policy, popularized by fast.ai, uses a single cycle with warm-up, peak, and decay phases.

The benefits are twofold. First, periodic high learning rates help escape sharp local minima that generalize poorly. Second, the approach reduces the need for extensive hyperparameter tuning—you primarily need to find the maximum learning rate.

class OneCycleScheduler:
    def __init__(self, optimizer, max_lr, total_steps, pct_start=0.3, 
                 div_factor=25, final_div_factor=1e4):
        self.optimizer = optimizer
        self.max_lr = max_lr
        self.total_steps = total_steps
        self.step_num = 0
        
        self.warmup_steps = int(total_steps * pct_start)
        self.start_lr = max_lr / div_factor
        self.final_lr = max_lr / final_div_factor
    
    def step(self):
        if self.step_num < self.warmup_steps:
            # Increase to max_lr
            lr = self.start_lr + (self.max_lr - self.start_lr) * \
                 self.step_num / self.warmup_steps
        else:
            # Cosine decay to final_lr
            progress = (self.step_num - self.warmup_steps) / \
                       (self.total_steps - self.warmup_steps)
            lr = self.final_lr + (self.max_lr - self.final_lr) * \
                 0.5 * (1 + np.cos(np.pi * progress))
        
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr
        
        self.step_num += 1
        return lr

# Compare 1cycle vs standard decay
optimizer_1cycle = optim.SGD(model.parameters(), lr=0.1)
optimizer_standard = optim.SGD(model.parameters(), lr=0.1)

scheduler_1cycle = OneCycleScheduler(optimizer_1cycle, max_lr=0.5, 
                                     total_steps=1000)
scheduler_standard = CosineAnnealingLR(optimizer_standard, T_max=1000)

lrs_1cycle = [scheduler_1cycle.step() for _ in range(1000)]
lrs_standard = [optimizer_standard.param_groups[0]['lr']]
for _ in range(999):
    scheduler_standard.step()
    lrs_standard.append(optimizer_standard.param_groups[0]['lr'])

plt.plot(lrs_1cycle, label='1cycle')
plt.plot(lrs_standard, label='Cosine Annealing')
plt.legend()
plt.xlabel('Step')
plt.ylabel('Learning Rate')

The 1cycle policy often trains 2-3x faster than traditional schedules and frequently achieves better final accuracy. It’s particularly effective for CNNs and should be your first choice when training from scratch.

Choosing and Tuning Schedules

The right schedule depends on your architecture and training regime:

  • CNNs from scratch: 1cycle or cosine annealing with warm-up
  • Transfer learning: Step decay or ReduceLROnPlateau with small initial LR
  • Transformers: Warm-up + cosine/linear decay (warm-up is critical)
  • Small datasets (<10k samples): Conservative schedules, ReduceLROnPlateau
  • Large datasets: Aggressive schedules, 1cycle

Before choosing schedule parameters, use a learning rate finder to determine the optimal range:

class LRFinder:
    def __init__(self, model, optimizer, criterion, device):
        self.model = model
        self.optimizer = optimizer
        self.criterion = criterion
        self.device = device
    
    def find(self, train_loader, start_lr=1e-7, end_lr=10, num_iter=100):
        self.model.train()
        lrs = []
        losses = []
        
        lr_mult = (end_lr / start_lr) ** (1 / num_iter)
        lr = start_lr
        
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr
        
        iterator = iter(train_loader)
        for iteration in range(num_iter):
            try:
                inputs, targets = next(iterator)
            except StopIteration:
                iterator = iter(train_loader)
                inputs, targets = next(iterator)
            
            inputs, targets = inputs.to(self.device), targets.to(self.device)
            
            self.optimizer.zero_grad()
            outputs = self.model(inputs)
            loss = self.criterion(outputs, targets)
            loss.backward()
            self.optimizer.step()
            
            lrs.append(lr)
            losses.append(loss.item())
            
            lr *= lr_mult
            for param_group in self.optimizer.param_groups:
                param_group['lr'] = lr
            
            if loss.item() > 4 * min(losses):
                break
        
        return lrs, losses

# Usage
lr_finder = LRFinder(model, optimizer, criterion, device='cuda')
lrs, losses = lr_finder.find(train_loader)

plt.plot(lrs, losses)
plt.xscale('log')
plt.xlabel('Learning Rate')
plt.ylabel('Loss')
plt.title('LR Finder - Choose LR at steepest decline')

Look for the steepest downward slope in the loss curve. Your maximum learning rate should be about 10x smaller than where the loss starts increasing.

Best Practices and Common Pitfalls

Do:

  • Always use a learning rate finder before training
  • Monitor both training and validation metrics when using ReduceLROnPlateau
  • Save checkpoints before major learning rate changes
  • Use warm-up for transformers and large batch training
  • Log learning rate alongside loss metrics

Don’t:

  • Don’t decay learning rate too early—let the model learn at high rates initially
  • Don’t use the same schedule for fine-tuning that you use for training from scratch
  • Don’t forget to call scheduler.step() in your training loop
  • Don’t use aggressive schedules on small datasets (high risk of overfitting)

Here’s a complete training pipeline incorporating these principles:

def train_with_scheduling(model, train_loader, val_loader, epochs=100):
    optimizer = optim.AdamW(model.parameters(), lr=1e-4)
    criterion = nn.CrossEntropyLoss()
    
    # Find optimal learning rate
    lr_finder = LRFinder(model, optimizer, criterion, 'cuda')
    lrs, losses = lr_finder.find(train_loader)
    optimal_lr = lrs[np.argmin(np.gradient(losses))] / 10
    
    # Reinitialize optimizer with found LR
    optimizer = optim.AdamW(model.parameters(), lr=optimal_lr)
    scheduler = OneCycleScheduler(optimizer, max_lr=optimal_lr, 
                                  total_steps=epochs * len(train_loader))
    
    best_val_loss = float('inf')
    
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to('cuda'), target.to('cuda')
            
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            scheduler.step()
            
            train_loss += loss.item()
        
        # Validation
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for data, target in val_loader:
                data, target = data.to('cuda'), target.to('cuda')
                output = model(data)
                val_loss += criterion(output, target).item()
        
        val_loss /= len(val_loader)
        current_lr = optimizer.param_groups[0]['lr']
        
        print(f'Epoch {epoch}: Train Loss: {train_loss/len(train_loader):.4f}, '
              f'Val Loss: {val_loss:.4f}, LR: {current_lr:.6f}')
        
        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'val_loss': val_loss,
            }, 'best_model.pth')

Learning rate scheduling is not optional—it’s essential for achieving state-of-the-art results. Start with a learning rate finder, choose a schedule appropriate for your architecture, and monitor your metrics closely. The investment in proper scheduling pays dividends in faster training and better final performance.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.