How to Implement Object Detection in PyTorch

Key Insights

Object detection combines classification and localization, requiring models to predict both what objects are present and where they’re located using bounding boxes—Faster R-CNN provides an excellent starting point with its two-stage detection approach
Transfer learning from COCO pre-trained weights dramatically reduces training time and data requirements, letting you fine-tune on custom datasets with as few as hundreds of labeled images
The key implementation challenge lies in properly formatting your data pipeline—bounding boxes must be in the correct format (x_min, y_min, x_max, y_max) and paired with labels in dictionaries that PyTorch’s detection models expect

Introduction to Object Detection

Object detection goes beyond image classification by answering two questions simultaneously: “What objects are in this image?” and “Where are they located?” While a classifier outputs a single label per image, an object detector produces multiple bounding boxes with associated class labels and confidence scores.

Common applications include autonomous driving (detecting pedestrians and vehicles), retail analytics (counting products on shelves), medical imaging (identifying tumors), and surveillance systems. The field has evolved through several architectural paradigms: two-stage detectors like Faster R-CNN that first propose regions then classify them, single-stage detectors like YOLO and SSD that predict everything in one pass, and modern transformer-based approaches.

For this guide, we’ll implement Faster R-CNN because it offers an excellent balance of accuracy and interpretability. By the end, you’ll have a working object detector that you can train on custom datasets.

Setting Up the Environment

Install the required dependencies first. PyTorch with torchvision provides pre-built detection models and utilities:

pip install torch torchvision opencv-python matplotlib pycocotools

For GPU acceleration, ensure you install the CUDA-enabled PyTorch version matching your system. Verify your installation:

import torch
import torchvision
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Object detection datasets follow the COCO format: images paired with annotations containing bounding boxes in [x_min, y_min, width, height] format, though PyTorch expects [x_min, y_min, x_max, y_max]. Each annotation includes a category ID and optionally a segmentation mask.

Here’s how to load a sample dataset:

from torchvision.datasets import CocoDetection
from torch.utils.data import DataLoader

# Load COCO dataset (or your custom dataset in COCO format)
dataset = CocoDetection(
    root='path/to/images',
    annFile='path/to/annotations.json'
)

# Basic dataset inspection
img, target = dataset[0]
print(f"Image size: {img.size}")
print(f"Number of objects: {len(target)}")

Understanding the Model Architecture

Faster R-CNN operates in two stages. The backbone network (typically ResNet-50 or ResNet-101) extracts feature maps from the input image. The Region Proposal Network (RPN) then scans these features to propose regions likely to contain objects. Finally, the detection head classifies each proposal and refines its bounding box.

This architecture achieves high accuracy because it focuses computational resources on promising regions rather than exhaustively scanning every possible location and scale.

Load a pre-trained model from torchvision:

import torchvision.models.detection as detection

# Load Faster R-CNN with ResNet-50 backbone, pre-trained on COCO
model = detection.fasterrcnn_resnet50_fpn(pretrained=True)

# Inspect the model structure
print(model)

# Modify the classifier for custom number of classes (background + your classes)
num_classes = 5  # background + 4 custom classes
in_features = model.roi_heads.box_predictor.cls_score.in_features

from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

The Feature Pyramid Network (FPN) in the backbone name enables multi-scale detection by building a pyramid of feature maps at different resolutions.

Preparing Data and Transforms

Data preparation is critical. Your custom Dataset class must return images as tensors and targets as dictionaries containing ‘boxes’, ’labels’, and optionally ‘image_id’ and ‘area’.

import torch
from torch.utils.data import Dataset
from PIL import Image
import json

class CustomObjectDetectionDataset(Dataset):
    def __init__(self, img_dir, annotation_file, transforms=None):
        self.img_dir = img_dir
        self.transforms = transforms
        
        with open(annotation_file, 'r') as f:
            self.annotations = json.load(f)
        
        self.imgs = list(self.annotations.keys())
    
    def __len__(self):
        return len(self.imgs)
    
    def __getitem__(self, idx):
        img_name = self.imgs[idx]
        img_path = f"{self.img_dir}/{img_name}"
        img = Image.open(img_path).convert("RGB")
        
        # Get annotations for this image
        anns = self.annotations[img_name]
        
        boxes = []
        labels = []
        
        for ann in anns:
            # Convert [x, y, w, h] to [x_min, y_min, x_max, y_max]
            x, y, w, h = ann['bbox']
            boxes.append([x, y, x + w, y + h])
            labels.append(ann['category_id'])
        
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        labels = torch.as_tensor(labels, dtype=torch.int64)
        
        target = {
            'boxes': boxes,
            'labels': labels,
            'image_id': torch.tensor([idx])
        }
        
        if self.transforms:
            img = self.transforms(img)
        
        return img, target

# Transform images to tensors
import torchvision.transforms as T

def get_transform():
    transforms = []
    transforms.append(T.ToTensor())
    return T.Compose(transforms)

dataset = CustomObjectDetectionDataset(
    img_dir='data/images',
    annotation_file='data/annotations.json',
    transforms=get_transform()
)

For batching, use a custom collate function since targets are dictionaries:

def collate_fn(batch):
    return tuple(zip(*batch))

dataloader = DataLoader(
    dataset,
    batch_size=4,
    shuffle=True,
    collate_fn=collate_fn,
    num_workers=4
)

Training the Object Detector

Training involves iterating through batches, computing losses for both the RPN and detection head, and updating weights. PyTorch’s detection models return losses during training automatically.

import torch.optim as optim

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Use SGD with momentum (standard for object detection)
params = [p for p in model.parameters() if p.requires_grad]
optimizer = optim.SGD(params, lr=0.005, momentum=0.9, weight_decay=0.0005)

# Learning rate scheduler
lr_scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)

num_epochs = 10

for epoch in range(num_epochs):
    model.train()
    epoch_loss = 0
    
    for images, targets in dataloader:
        images = [img.to(device) for img in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
        
        # Forward pass - model returns loss dict during training
        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())
        
        # Backward pass
        optimizer.zero_grad()
        losses.backward()
        optimizer.step()
        
        epoch_loss += losses.item()
    
    lr_scheduler.step()
    
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss/len(dataloader):.4f}")

# Save the trained model
torch.save(model.state_dict(), 'object_detector.pth')

The loss combines classification loss, bounding box regression loss, and RPN losses. Transfer learning from COCO weights means you’re fine-tuning rather than training from scratch, which is why relatively few epochs suffice.

Inference and Visualization

For inference, switch the model to evaluation mode. The model returns predictions without computing losses.

import cv2
import numpy as np
from torchvision.ops import nms

def predict(image_path, model, device, threshold=0.5):
    model.eval()
    
    img = Image.open(image_path).convert("RGB")
    img_tensor = T.ToTensor()(img).unsqueeze(0).to(device)
    
    with torch.no_grad():
        predictions = model(img_tensor)[0]
    
    # Filter by confidence threshold
    keep = predictions['scores'] > threshold
    boxes = predictions['boxes'][keep].cpu().numpy()
    labels = predictions['labels'][keep].cpu().numpy()
    scores = predictions['scores'][keep].cpu().numpy()
    
    return boxes, labels, scores

def visualize_predictions(image_path, boxes, labels, scores, class_names):
    img = cv2.imread(image_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    for box, label, score in zip(boxes, labels, scores):
        x1, y1, x2, y2 = box.astype(int)
        cv2.rectangle(img, (x1, y1), (x2, y2), (0, 255, 0), 2)
        
        text = f"{class_names[label]}: {score:.2f}"
        cv2.putText(img, text, (x1, y1-10), 
                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
    
    return img

# Run inference
boxes, labels, scores = predict('test_image.jpg', model, device)
class_names = ['background', 'cat', 'dog', 'car', 'person']
result_img = visualize_predictions('test_image.jpg', boxes, labels, scores, class_names)

Non-Maximum Suppression (NMS) is already applied internally by the model to remove duplicate detections.

Evaluation and Next Steps

Evaluate your detector using mean Average Precision (mAP), the standard metric for object detection:

from torchvision.ops import box_iou

def calculate_map(model, dataloader, device, iou_threshold=0.5):
    model.eval()
    all_precisions = []
    
    with torch.no_grad():
        for images, targets in dataloader:
            images = [img.to(device) for img in images]
            predictions = model(images)
            
            # Compare predictions with ground truth
            # (Simplified - use pycocotools for production)
            for pred, target in zip(predictions, targets):
                pred_boxes = pred['boxes'].cpu()
                true_boxes = target['boxes']
                
                if len(pred_boxes) > 0 and len(true_boxes) > 0:
                    ious = box_iou(pred_boxes, true_boxes)
                    # Calculate precision/recall curves
                    # Aggregate to compute AP
    
    return np.mean(all_precisions)

For production use, leverage pycocotools for comprehensive evaluation metrics.

To improve performance: experiment with data augmentation (random flips, color jittering), try different backbones (ResNet-101, MobileNet for speed), adjust anchor sizes for the RPN to match your object scales, and increase training data through synthetic generation or additional labeling.

For deployment, consider model quantization for mobile devices, ONNX export for cross-platform inference, or TorchScript for production serving. The architecture you’ve built provides a solid foundation for real-world object detection applications.