How to Implement a Neural Network in PyTorch

Key Insights

PyTorch’s dynamic computation graph makes debugging neural networks significantly easier than static frameworks, allowing you to use standard Python debugging tools and modify architectures on the fly
The training loop in PyTorch requires explicit control over forward pass, loss computation, and backpropagation—this verbosity trades convenience for transparency and flexibility
Always use torch.no_grad() during evaluation to prevent memory leaks from gradient tracking, and move both model and data to the same device (CPU or GPU) to avoid runtime errors

Introduction to PyTorch and Neural Networks

PyTorch has become the dominant framework for deep learning research and increasingly for production systems. Unlike TensorFlow’s historically static computation graphs, PyTorch builds graphs dynamically at runtime. This means you can use standard Python control flow—if statements, loops, breakpoints—without special syntax. When your model throws an error, you get a normal Python stack trace pointing to the actual line of code that failed.

Choose PyTorch when you need flexibility and rapid experimentation. It excels at research, custom architectures, and situations where your network structure depends on input data. TensorFlow/JAX might edge ahead for massive-scale production deployments, but PyTorch’s torchscript and recent optimizations have closed that gap considerably.

We’ll build an image classifier for CIFAR-10, a dataset of 60,000 32x32 color images across 10 classes (airplanes, cars, birds, etc.). This is complex enough to demonstrate real neural network concepts but simple enough to train on a laptop in minutes.

Setting Up the Environment and Data Loading

Install PyTorch with CUDA support if you have an NVIDIA GPU. Visit pytorch.org and use their configurator, or run:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

For CPU-only systems:

pip install torch torchvision torchaudio

PyTorch’s torchvision package includes common datasets and transforms. Here’s how to load CIFAR-10 with proper preprocessing:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision
import torchvision.transforms as transforms

# Define transformations: convert to tensor and normalize
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Download and load training data
train_dataset = torchvision.datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

# Download and load test data
test_dataset = torchvision.datasets.CIFAR10(
    root='./data',
    train=False,
    download=True,
    transform=transform
)

# Create data loaders with batching
train_loader = DataLoader(
    train_dataset,
    batch_size=64,
    shuffle=True,
    num_workers=2
)

test_loader = DataLoader(
    test_dataset,
    batch_size=64,
    shuffle=False,
    num_workers=2
)

The DataLoader handles batching, shuffling, and parallel data loading. Set num_workers > 0 to load data in background processes while your GPU trains. Shuffling training data prevents the model from learning spurious patterns based on data order.

Defining the Neural Network Architecture

Every PyTorch neural network inherits from nn.Module. You define layers in __init__() and specify how data flows through them in forward(). PyTorch automatically handles backpropagation through any operations you perform in forward().

Here’s a convolutional neural network for CIFAR-10:

class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        # Convolutional layers
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
        
        # Pooling layer
        self.pool = nn.MaxPool2d(2, 2)
        
        # Fully connected layers
        self.fc1 = nn.Linear(64 * 4 * 4, 512)
        self.fc2 = nn.Linear(512, 10)
        
        # Dropout for regularization
        self.dropout = nn.Dropout(0.5)
        
    def forward(self, x):
        # Conv block 1
        x = self.pool(torch.relu(self.conv1(x)))
        # Conv block 2
        x = self.pool(torch.relu(self.conv2(x)))
        # Conv block 3
        x = self.pool(torch.relu(self.conv3(x)))
        
        # Flatten for fully connected layers
        x = x.view(-1, 64 * 4 * 4)
        
        # Fully connected layers with dropout
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        
        return x

For simpler architectures, use nn.Sequential:

simple_model = nn.Sequential(
    nn.Conv2d(3, 16, 3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2, 2),
    nn.Flatten(),
    nn.Linear(16 * 16 * 16, 10)
)

The custom class approach provides more control when you need conditional logic, multiple outputs, or complex forward passes.

Training the Model

Training requires four components: a model, a loss function, an optimizer, and a training loop. The loop repeatedly feeds data through the model, calculates loss, computes gradients, and updates weights.

# Set device (GPU if available)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

# Initialize model, loss, and optimizer
model = ConvNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 10

for epoch in range(num_epochs):
    model.train()  # Set model to training mode
    running_loss = 0.0
    
    for i, (images, labels) in enumerate(train_loader):
        # Move data to device
        images = images.to(device)
        labels = labels.to(device)
        
        # Zero the gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
        # Print statistics every 100 batches
        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], '
                  f'Step [{i+1}/{len(train_loader)}], '
                  f'Loss: {running_loss/100:.4f}')
            running_loss = 0.0

Critical details: Call optimizer.zero_grad() before each backward pass because PyTorch accumulates gradients by default. Use .to(device) on both model and data—mismatched devices cause cryptic errors. The loss.backward() computes gradients, and optimizer.step() updates weights using those gradients.

Use Adam optimizer as your default—it adapts learning rates per parameter and works well out of the box. Switch to SGD with momentum only if you have specific reasons or are replicating research.

Evaluating and Testing

Evaluation differs from training: disable gradient computation to save memory and speed up inference. Use model.eval() to disable dropout and batch normalization training behavior.

def evaluate_model(model, test_loader, device):
    model.eval()  # Set model to evaluation mode
    correct = 0
    total = 0
    
    with torch.no_grad():  # Disable gradient computation
        for images, labels in test_loader:
            images = images.to(device)
            labels = labels.to(device)
            
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    accuracy = 100 * correct / total
    return accuracy

# Evaluate after training
test_accuracy = evaluate_model(model, test_loader, device)
print(f'Test Accuracy: {test_accuracy:.2f}%')

The torch.no_grad() context manager prevents PyTorch from building the computation graph and storing intermediate values for backpropagation. Without it, you’ll run out of memory on large test sets.

For single image prediction:

def predict_single_image(model, image, device):
    model.eval()
    with torch.no_grad():
        image = image.unsqueeze(0).to(device)  # Add batch dimension
        output = model(image)
        _, predicted = torch.max(output, 1)
        return predicted.item()

Saving and Loading Models

Save models after training to avoid retraining. PyTorch offers two approaches: saving the entire model or just the state dictionary (weights). Always prefer saving state dictionaries—they’re more portable and version-stable.

# Save model weights (recommended)
torch.save(model.state_dict(), 'cifar_model.pth')

# Save entire model (less portable)
torch.save(model, 'cifar_model_full.pth')

# Save checkpoint with optimizer state for resuming training
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss,
}
torch.save(checkpoint, 'checkpoint.pth')

Loading models requires recreating the architecture first:

# Load weights into new model instance
model = ConvNet()
model.load_state_dict(torch.load('cifar_model.pth'))
model.to(device)
model.eval()

# Load full checkpoint
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']

When loading on CPU from GPU-trained models, use:

model.load_state_dict(torch.load('model.pth', map_location=torch.device('cpu')))

This complete implementation gives you a production-ready training pipeline. The pattern—data loading, model definition, training loop, evaluation, and persistence—applies to virtually any neural network architecture. Start here, then customize based on your specific problem domain.