How to Use GPU Training in PyTorch

Key Insights

Moving PyTorch models to GPU requires explicitly transferring both the model and data using .to(device), and mixing CPU/GPU tensors is the most common source of runtime errors
GPU training can provide 10-50x speedups for deep learning workloads, but only becomes beneficial when batch sizes and model complexity are large enough to saturate GPU cores
Modern PyTorch applications should use DistributedDataParallel over DataParallel for multi-GPU training, and leverage automatic mixed precision (AMP) to maximize throughput on recent GPU architectures

Introduction to GPU Training in PyTorch

GPUs accelerate deep learning training by orders of magnitude because neural networks are fundamentally matrix multiplication operations executed repeatedly. While CPUs excel at sequential tasks with complex logic, GPUs contain thousands of smaller cores optimized for parallel arithmetic operations. A single modern GPU can perform trillions of floating-point operations per second, making it ideal for the massive matrix computations required during forward and backward passes through neural networks.

The performance difference is substantial. Training a ResNet-50 on ImageNet might take weeks on a CPU but only hours on a high-end GPU. For small models with minimal data, the overhead of GPU memory transfers can negate the benefits, but once your model exceeds a few million parameters or your dataset requires more than a few minutes of CPU training time, GPU acceleration becomes essential.

Here’s a concrete comparison showing the performance difference:

import torch
import torch.nn as nn
import time

# Simple neural network
model = nn.Sequential(
    nn.Linear(1024, 2048),
    nn.ReLU(),
    nn.Linear(2048, 512),
    nn.ReLU(),
    nn.Linear(512, 10)
)

# Generate random data
data = torch.randn(128, 1024)
target = torch.randint(0, 10, (128,))

# CPU training
model_cpu = model
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model_cpu.parameters())

start = time.time()
for _ in range(100):
    optimizer.zero_grad()
    output = model_cpu(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()
cpu_time = time.time() - start

# GPU training
if torch.cuda.is_available():
    model_gpu = model.cuda()
    data_gpu = data.cuda()
    target_gpu = target.cuda()
    optimizer = torch.optim.Adam(model_gpu.parameters())
    
    start = time.time()
    for _ in range(100):
        optimizer.zero_grad()
        output = model_gpu(data_gpu)
        loss = criterion(output, target_gpu)
        loss.backward()
        optimizer.step()
    gpu_time = time.time() - start
    
    print(f"CPU time: {cpu_time:.2f}s")
    print(f"GPU time: {gpu_time:.2f}s")
    print(f"Speedup: {cpu_time/gpu_time:.2f}x")

Checking GPU Availability and Setup

Before using GPU training, verify that PyTorch can access your CUDA-enabled GPU. This requires installing the CUDA toolkit and the GPU-enabled version of PyTorch. The installation command varies by CUDA version, so check the official PyTorch website for the correct pip or conda command.

Check GPU availability and properties with these commands:

import torch

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")

# Number of GPUs
print(f"Number of GPUs: {torch.cuda.device_count()}")

# Current GPU
if torch.cuda.is_available():
    print(f"Current device: {torch.cuda.current_device()}")
    print(f"Device name: {torch.cuda.get_device_name(0)}")
    
    # Detailed GPU properties
    props = torch.cuda.get_device_properties(0)
    print(f"Total memory: {props.total_memory / 1e9:.2f} GB")
    print(f"Compute capability: {props.major}.{props.minor}")
    print(f"Multi-processor count: {props.multi_processor_count}")

If torch.cuda.is_available() returns False, you either don’t have a compatible GPU, haven’t installed CUDA drivers, or installed the CPU-only version of PyTorch. Reinstall PyTorch with the appropriate CUDA version for your system.

Moving Models and Data to GPU

PyTorch doesn’t automatically use the GPU. You must explicitly move tensors and models to GPU memory using the .to() or .cuda() method. The .to() method is more flexible and recommended for production code because it accepts device objects, making your code device-agnostic.

import torch
import torch.nn as nn

# Define device (works whether GPU is available or not)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Create a model
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 10)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        return self.fc2(x)

model = SimpleNet().to(device)  # Move model to GPU

# Create tensors on GPU
x = torch.randn(32, 784).to(device)
y = torch.randint(0, 10, (32,)).to(device)

# Alternative: create directly on GPU
x_gpu = torch.randn(32, 784, device=device)

# For multi-GPU systems, specify GPU index
if torch.cuda.device_count() > 1:
    model_gpu1 = SimpleNet().to('cuda:1')  # Use second GPU

The critical rule: all tensors in an operation must be on the same device. Attempting to multiply a CPU tensor with a GPU tensor will raise a runtime error. This is the most common mistake when starting with GPU training.

Complete GPU Training Example

Here’s a full training loop demonstrating proper GPU usage patterns:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Create synthetic dataset
X_train = torch.randn(1000, 784)
y_train = torch.randint(0, 10, (1000,))
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Define model
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 10)
        )
    
    def forward(self, x):
        return self.layers(x)

model = MLP().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    
    for batch_idx, (data, target) in enumerate(train_loader):
        # Move batch to GPU
        data, target = data.to(device), target.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

Multi-GPU Training with DataParallel and DistributedDataParallel

When you have multiple GPUs, PyTorch offers two approaches for parallel training. DataParallel is simpler but has limitations. DistributedDataParallel is more complex but significantly faster and the recommended approach for serious multi-GPU training.

# DataParallel (simple but not optimal)
if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs")
    model = nn.DataParallel(model)
model.to(device)

# DistributedDataParallel (recommended)
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup_ddp(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

# In your training script
def train_ddp(rank, world_size):
    setup_ddp(rank, world_size)
    
    model = MLP().to(rank)
    ddp_model = DDP(model, device_ids=[rank])
    
    # Training loop with ddp_model instead of model
    # ... rest of training code

DataParallel replicates the model on all GPUs but performs gradient aggregation on GPU 0, creating a bottleneck. DistributedDataParallel uses a ring-allreduce algorithm to distribute gradient synchronization across all GPUs, providing much better scaling efficiency.

Common Pitfalls and Best Practices

Mixed CPU/GPU Tensors: Always verify tensors are on the correct device. Add assertions during debugging:

assert data.device == model.parameters().__next__().device, "Device mismatch!"

Memory Management: GPU memory isn’t automatically released. Clear cache when needed:

# Clear unused cached memory
torch.cuda.empty_cache()

# Monitor memory usage
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Faster Data Loading: Use pinned memory for faster CPU-to-GPU transfers:

train_loader = DataLoader(
    train_dataset,
    batch_size=64,
    shuffle=True,
    pin_memory=True,  # Faster GPU transfer
    num_workers=4     # Parallel data loading
)

Debugging CUDA OOM Errors: Reduce batch size, use gradient accumulation, or enable gradient checkpointing for large models:

# Gradient accumulation (effective batch size = batch_size * accumulation_steps)
accumulation_steps = 4
for i, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    output = model(data)
    loss = criterion(output, target) / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Performance Optimization Tips

Batch Size Tuning: Larger batches better utilize GPU parallelism. Increase batch size until you hit memory limits, then reduce slightly for stability.

Automatic Mixed Precision: Use FP16 for 2-3x speedup on modern GPUs:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for epoch in range(num_epochs):
    for data, target in train_loader:
        data, target = data.to(device), target.to(device)
        
        optimizer.zero_grad()
        
        # Forward pass with autocast
        with autocast():
            output = model(data)
            loss = criterion(output, target)
        
        # Backward pass with gradient scaling
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

Profile GPU Utilization: Use nvidia-smi to monitor GPU usage. If utilization is below 80%, you likely have a data loading bottleneck. Increase num_workers in DataLoader or optimize preprocessing.

GPU training transforms deep learning from impractical to production-ready. Master these patterns, avoid common pitfalls, and you’ll efficiently train models that would be impossible on CPU alone.