Deep Learning: Activation Functions Explained

Key Insights

Activation functions introduce non-linearity into neural networks—without them, stacking multiple layers provides no advantage over a single linear transformation, regardless of network depth.
ReLU revolutionized deep learning by solving the vanishing gradient problem that plagued sigmoid and tanh, enabling training of networks with dozens or hundreds of layers through simple thresholding at zero.
Modern activation functions like GELU and Swish sacrifice ReLU’s computational simplicity for smoother gradients and better performance in transformer architectures, where their probabilistic gating mechanisms align well with attention patterns.

Introduction to Activation Functions

Neural networks transform inputs through layers of weighted sums followed by activation functions. The activation function determines whether and how strongly a neuron should “fire” based on its input. This seems like a minor implementation detail, but it’s fundamental to why deep learning works at all.

Consider what happens without activation functions. Each layer performs a linear transformation: output = weights × input + bias. Stack multiple layers together, and you’re just composing linear transformations. The mathematical reality is harsh: the composition of linear functions is itself linear. A 100-layer network without activation functions is mathematically equivalent to a single-layer network. You gain nothing from depth.

Activation functions break this linearity. They introduce non-linear transformations that allow networks to approximate complex, non-linear functions. The XOR problem illustrates this perfectly—it’s a simple non-linear problem that linear models cannot solve:

import numpy as np
import matplotlib.pyplot as plt

# XOR dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

# Linear model (without activation)
class LinearNetwork:
    def __init__(self):
        self.W1 = np.random.randn(2, 4) * 0.01
        self.W2 = np.random.randn(4, 1) * 0.01
    
    def forward(self, x):
        h = x @ self.W1  # No activation
        out = h @ self.W2  # No activation
        return out

# Network with ReLU activation
class NonLinearNetwork:
    def __init__(self):
        self.W1 = np.random.randn(2, 4) * 0.01
        self.W2 = np.random.randn(4, 1) * 0.01
    
    def forward(self, x):
        h = np.maximum(0, x @ self.W1)  # ReLU activation
        out = h @ self.W2
        return out

# The linear network will fail on XOR regardless of training
# The non-linear network can learn the XOR function

This fundamental limitation drove the search for effective activation functions, leading to decades of research and experimentation.

Classic Activation Functions

The earliest neural networks used sigmoid and tanh functions. Sigmoid squashes inputs to the range (0, 1), while tanh maps to (-1, 1). Both are smooth, differentiable, and historically made sense because they mimicked biological neuron behavior—neurons either fire or don’t, with smooth transitions.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x) ** 2

# Visualize functions and derivatives
x = np.linspace(-6, 6, 1000)

fig, axes = plt.subplots(2, 2, figsize=(12, 8))

axes[0, 0].plot(x, sigmoid(x))
axes[0, 0].set_title('Sigmoid')
axes[0, 0].grid(True)

axes[0, 1].plot(x, sigmoid_derivative(x))
axes[0, 1].set_title('Sigmoid Derivative')
axes[0, 1].grid(True)

axes[1, 0].plot(x, tanh(x))
axes[1, 0].set_title('Tanh')
axes[1, 0].grid(True)

axes[1, 1].plot(x, tanh_derivative(x))
axes[1, 1].set_title('Tanh Derivative')
axes[1, 1].grid(True)

plt.tight_layout()

The problem becomes apparent when you examine the derivatives. For inputs with large absolute values, the gradients approach zero. During backpropagation in deep networks, these small gradients multiply together, exponentially shrinking as they propagate backward. This is the vanishing gradient problem, and it made training deep networks nearly impossible before 2010.

# Demonstrate vanishing gradients
def deep_sigmoid_network(x, depth=10):
    """Simulate gradient flow through deep sigmoid network"""
    gradients = []
    activation = x
    
    for layer in range(depth):
        # Forward pass
        activation = sigmoid(activation)
        # Gradient of this layer
        grad = sigmoid_derivative(activation)
        gradients.append(grad)
    
    # Backprop: multiply gradients
    total_gradient = np.prod(gradients)
    return total_gradient

x = np.array([2.0])
for depth in [5, 10, 20, 50]:
    grad = deep_sigmoid_network(x, depth)
    print(f"Depth {depth}: gradient = {grad:.10f}")

# Output shows exponential decay:
# Depth 5: gradient = 0.0000034567
# Depth 10: gradient = 0.0000000001
# Depth 20: gradient ≈ 0

ReLU and Its Variants

ReLU (Rectified Linear Unit) changed everything. The function is embarrassingly simple: f(x) = max(0, x). If the input is positive, pass it through unchanged. If negative, output zero. This simplicity is its strength.

ReLU solves vanishing gradients for positive inputs—the derivative is exactly 1, so gradients flow backward without attenuation. It’s also computationally cheap: just a comparison and a selection, no expensive exponentials. Training became faster and deeper networks became feasible.

But ReLU isn’t perfect. “Dying ReLU” occurs when neurons get stuck outputting zero for all inputs, with zero gradients preventing any recovery. Variants address this:

import torch
import torch.nn as nn

class ActivationComparison(nn.Module):
    def __init__(self):
        super().__init__()
        
    def relu(self, x):
        """Standard ReLU: max(0, x)"""
        return torch.maximum(torch.zeros_like(x), x)
    
    def leaky_relu(self, x, alpha=0.01):
        """Leaky ReLU: allows small negative gradient"""
        return torch.where(x > 0, x, alpha * x)
    
    def prelu(self, x, alpha):
        """Parametric ReLU: alpha is learned"""
        return torch.where(x > 0, x, alpha * x)
    
    def elu(self, x, alpha=1.0):
        """ELU: smooth curve for negative values"""
        return torch.where(x > 0, x, alpha * (torch.exp(x) - 1))

# Benchmark on MNIST
import torchvision
from torch.utils.data import DataLoader

def train_with_activation(activation_fn, epochs=5):
    model = nn.Sequential(
        nn.Flatten(),
        nn.Linear(784, 256),
        activation_fn,
        nn.Linear(256, 128),
        activation_fn,
        nn.Linear(128, 10)
    )
    
    train_data = torchvision.datasets.MNIST(
        './data', train=True, download=True,
        transform=torchvision.transforms.ToTensor()
    )
    loader = DataLoader(train_data, batch_size=128, shuffle=True)
    
    optimizer = torch.optim.Adam(model.parameters())
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(epochs):
        for batch_idx, (data, target) in enumerate(loader):
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()
            
            if batch_idx % 100 == 0:
                print(f'Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}')
    
    return model

# Compare training speeds
relu_model = train_with_activation(nn.ReLU())
leaky_relu_model = train_with_activation(nn.LeakyReLU(0.01))
elu_model = train_with_activation(nn.ELU())

Leaky ReLU allows a small gradient (typically 0.01) for negative inputs, preventing neurons from dying completely. Parametric ReLU makes this slope learnable. ELU provides a smooth curve for negative values, which can improve learning dynamics but reintroduces computational cost through the exponential function.

Modern Activation Functions

Recent architectures, especially transformers, have moved beyond ReLU toward smoother, non-monotonic functions. GELU (Gaussian Error Linear Unit) and Swish (also called SiLU) incorporate probabilistic elements and provide smooth gradients everywhere.

GELU approximates x * Φ(x) where Φ is the cumulative distribution function of the standard normal distribution. Swish is simply x * sigmoid(x). Both are smooth, non-monotonic, and have shown empirical improvements in transformer models.

import torch
import torch.nn.functional as F

def gelu(x):
    """GELU activation (approximate version)"""
    return 0.5 * x * (1 + torch.tanh(
        torch.sqrt(torch.tensor(2.0 / np.pi)) * 
        (x + 0.044715 * torch.pow(x, 3))
    ))

def swish(x, beta=1.0):
    """Swish/SiLU activation"""
    return x * torch.sigmoid(beta * x)

def mish(x):
    """Mish activation: x * tanh(softplus(x))"""
    return x * torch.tanh(F.softplus(x))

# Simple transformer block comparison
class TransformerFFN(nn.Module):
    def __init__(self, d_model, d_ff, activation):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.activation = activation
        
    def forward(self, x):
        return self.linear2(self.activation(self.linear1(x)))

# Test different activations
d_model, d_ff = 512, 2048
x = torch.randn(32, 128, d_model)  # batch, seq_len, features

gelu_ffn = TransformerFFN(d_model, d_ff, gelu)
swish_ffn = TransformerFFN(d_model, d_ff, swish)
relu_ffn = TransformerFFN(d_model, d_ff, F.relu)

# Forward pass timing
import time

for name, ffn in [('GELU', gelu_ffn), ('Swish', swish_ffn), ('ReLU', relu_ffn)]:
    start = time.time()
    for _ in range(100):
        output = ffn(x)
    elapsed = time.time() - start
    print(f'{name}: {elapsed:.4f}s')

Why do these work better in transformers? The smooth, non-monotonic nature provides richer gradient information, and the self-gating mechanism (multiplying input by a sigmoid-transformed version) creates a soft selection that aligns well with attention mechanisms.

Choosing the Right Activation Function

Here’s my decision framework:

For CNNs (computer vision): Start with ReLU. It’s fast, effective, and well-understood. Use Leaky ReLU or ELU if you encounter dying ReLU problems (monitor the percentage of dead neurons during training).

For Transformers (NLP, vision transformers): Use GELU. It’s the standard in BERT, GPT, and most modern transformers. The smoothness helps with optimization in these deep architectures.

For RNNs/LSTMs: Tanh in the hidden state transformations (often built-in), sigmoid for gates. These are mathematically motivated by the architecture design.

For output layers: Depends on your task. Sigmoid for binary classification, softmax for multi-class, linear for regression, tanh for outputs in [-1, 1] range.

# Practical comparison on same CNN architecture
class CNN(nn.Module):
    def __init__(self, activation):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.fc1 = nn.Linear(64 * 8 * 8, 128)
        self.fc2 = nn.Linear(128, 10)
        self.activation = activation
        self.pool = nn.MaxPool2d(2, 2)
        
    def forward(self, x):
        x = self.pool(self.activation(self.conv1(x)))
        x = self.pool(self.activation(self.conv2(x)))
        x = x.view(-1, 64 * 8 * 8)
        x = self.activation(self.fc1(x))
        x = self.fc2(x)
        return x

# Train with different activations and compare
activations = {
    'ReLU': nn.ReLU(),
    'LeakyReLU': nn.LeakyReLU(0.01),
    'ELU': nn.ELU(),
    'GELU': nn.GELU()
}

results = {}
for name, act in activations.items():
    model = CNN(act)
    # Training loop here...
    # results[name] = final_accuracy

Implementation Best Practices

When implementing custom activation functions, handle gradients correctly and watch for numerical stability:

class CustomActivation(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        # Your activation logic
        output = torch.where(input > 0, input, 0.01 * input)  # Leaky ReLU example
        return output
    
    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        # Gradient computation
        grad_input[input <= 0] *= 0.01
        return grad_input

# Numerical stability for sigmoid-based activations
def stable_sigmoid(x):
    """Numerically stable sigmoid"""
    return torch.where(
        x >= 0,
        1 / (1 + torch.exp(-x)),
        torch.exp(x) / (1 + torch.exp(x))
    )

Profile your implementations. Framework-native functions are heavily optimized:

import torch.utils.benchmark as benchmark

x = torch.randn(1000, 1000, device='cuda')

# Compare custom vs native
t_custom = benchmark.Timer(
    stmt='custom_relu(x)',
    globals={'custom_relu': lambda x: torch.maximum(torch.zeros_like(x), x), 'x': x}
)

t_native = benchmark.Timer(
    stmt='F.relu(x)',
    globals={'F': F, 'x': x}
)

print(f"Custom: {t_custom.timeit(100)}")
print(f"Native: {t_native.timeit(100)}")
# Native is typically 2-5x faster due to kernel fusion

Use native implementations when available. Only implement custom activations when experimenting with novel functions for research purposes. The performance difference matters when training large models for days or weeks.

Activation functions are a solved problem for most applications. ReLU for CNNs, GELU for transformers, and you’re 95% of the way there. The remaining 5% is where experimentation and domain knowledge come into play.