Linux Networking: TCP/IP Stack Configuration
The Linux kernel implements the full TCP/IP protocol stack in kernel space, handling everything from link layer operations through application-level socket interfaces. This implementation spans...
Key Insights
- Linux TCP/IP stack tuning can dramatically improve application performance, but defaults are conservative and rarely optimal for production workloads—understanding kernel parameters like
tcp_congestion_controland connection tracking limits is essential for any infrastructure engineer. - The sysctl interface provides runtime control over hundreds of networking parameters, but changes must be made persistent through
/etc/sysctl.confor/etc/sysctl.d/and validated after reboot to avoid configuration drift. - Modern congestion control algorithms like BBR can increase throughput by 2-4x over legacy CUBIC in high-latency networks, yet most distributions still ship with suboptimal defaults that prioritize compatibility over performance.
TCP/IP Stack Fundamentals in Linux
The Linux kernel implements the full TCP/IP protocol stack in kernel space, handling everything from link layer operations through application-level socket interfaces. This implementation spans multiple layers: the network device drivers, IP routing and forwarding, TCP/UDP protocol handling, and socket buffer management. Understanding this architecture matters because every packet traversing your system touches multiple subsystems, each with configurable parameters that affect performance, reliability, and security.
The kernel exposes these configuration knobs through the sysctl interface, which provides runtime access to kernel parameters without recompilation. For networking specifically, parameters live under the net.* namespace, with IPv4 settings under net.ipv4.* and core networking under net.core.*.
View current TCP/IP parameters:
sysctl -a | grep net.ipv4 | head -20
This dumps hundreds of parameters. Key categories include TCP behavior (net.ipv4.tcp_*), IP routing (net.ipv4.conf.*), and connection tracking (net.netfilter.*).
Check network stack statistics to understand current behavior:
# Overview of protocol statistics
netstat -s
# Socket summary
ss -s
The netstat -s output shows counters for segments sent, retransmissions, connection resets, and other vital metrics. These counters help identify whether your tuning efforts are working or if you’re hitting specific bottlenecks.
Kernel Parameters and sysctl Configuration
The sysctl interface allows both temporary and persistent configuration changes. Temporary changes help with testing, while persistent changes ensure your tuning survives reboots.
Set a parameter temporarily (lost on reboot):
sysctl -w net.ipv4.tcp_fin_timeout=30
Make changes persistent by editing /etc/sysctl.conf or creating files in /etc/sysctl.d/:
# /etc/sysctl.d/99-network-tuning.conf
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_syn_backlog = 8192
net.core.somaxconn = 4096
net.core.netdev_max_backlog = 5000
Apply changes without rebooting:
sysctl -p /etc/sysctl.d/99-network-tuning.conf
Critical parameters you should understand:
tcp_fin_timeout: Time to hold socket in FIN-WAIT-2 state. Default is 60 seconds, but 30 is usually sufficient and reduces socket exhaustion on busy servers.
tcp_tw_reuse: Allows reusing TIME-WAIT sockets for new outgoing connections. Enable this (set to 1) on clients making many outbound connections.
tcp_max_syn_backlog: Maximum number of queued connection requests. Increase this on high-traffic servers to prevent SYN flooding from causing legitimate connection drops.
somaxconn: Maximum length of the socket listen queue. Applications call listen() with a backlog parameter, but this value caps it. Default of 128 is absurdly low for production.
netdev_max_backlog: Maximum packets queued on the input side when interface receives packets faster than kernel can process them.
View current values:
sysctl net.ipv4.tcp_fin_timeout net.ipv4.tcp_tw_reuse net.core.somaxconn
TCP Tuning for Performance
TCP performance depends heavily on congestion control algorithms, window sizes, and buffer management. Linux supports multiple congestion control algorithms, each optimized for different network conditions.
Check current congestion control algorithm:
sysctl net.ipv4.tcp_congestion_control
sysctl net.ipv4.tcp_available_congestion_control
The default is typically CUBIC, which works well for most scenarios. However, BBR (Bottleneck Bandwidth and RTT) from Google often performs better, especially on networks with packet loss or variable latency.
Enable BBR (requires kernel 4.9+):
# Load the module
modprobe tcp_bbr
# Set as default
sysctl -w net.ipv4.tcp_congestion_control=bbr
# Verify
sysctl net.ipv4.tcp_congestion_control
Make BBR persistent:
# /etc/modules-load.d/bbr.conf
tcp_bbr
# /etc/sysctl.d/99-network-tuning.conf
net.ipv4.tcp_congestion_control = bbr
Configure TCP buffer sizes for high-throughput applications:
# /etc/sysctl.d/99-network-tuning.conf
# min, default, max in bytes
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
These settings allow TCP windows to scale up to 16MB, critical for high-bandwidth, high-latency networks (think cross-datacenter replication). The kernel auto-tunes within these bounds based on connection characteristics.
Enable TCP window scaling (usually on by default):
sysctl net.ipv4.tcp_window_scaling
Network Interface Configuration
Beyond kernel parameters, network interfaces themselves require configuration. The modern ip command from iproute2 has replaced legacy ifconfig.
View interface configuration:
ip addr show
ip link show
Configure MTU (Maximum Transmission Unit) for jumbo frames on 10GbE networks:
ip link set dev eth0 mtu 9000
Jumbo frames reduce CPU overhead by allowing larger packets, but only enable them if your entire network path supports them. Mismatched MTUs cause fragmentation and performance degradation.
Make MTU changes persistent (Ubuntu/Debian with netplan):
# /etc/netplan/01-netcfg.yaml
network:
version: 2
ethernets:
eth0:
mtu: 9000
Check hardware offload features:
ethtool -k eth0
This shows features like TCP segmentation offload (TSO), generic receive offload (GRO), and checksum offloading. These offloads move work from CPU to NIC, improving performance. Modern NICs enable these by default, but verify them:
# Enable GRO if disabled
ethtool -K eth0 gro on
# Enable TSO
ethtool -K eth0 tso on
Firewall and Connection Tracking
The netfilter connection tracking system maintains state for all connections, consuming memory and CPU. On high-traffic servers, default limits cause dropped connections.
Check connection tracking limits:
sysctl net.netfilter.nf_conntrack_max
sysctl net.netfilter.nf_conntrack_count
If nf_conntrack_count approaches nf_conntrack_max, you’re dropping connections. Increase the limit:
# /etc/sysctl.d/99-network-tuning.conf
net.netfilter.nf_conntrack_max = 262144
net.netfilter.nf_conntrack_tcp_timeout_established = 600
The timeout setting controls how long established connections stay in the tracking table. Default is 432000 seconds (5 days), which wastes memory.
View current tracked connections:
conntrack -L | head -20
conntrack -S # Statistics
Basic iptables rules affecting TCP behavior:
# Allow established connections
iptables -A INPUT -m conntrack --ctstate ESTABLISHED,RELATED -j ACCEPT
# Rate limit new connections to prevent SYN floods
iptables -A INPUT -p tcp --syn -m limit --limit 100/s --limit-burst 200 -j ACCEPT
Monitoring and Troubleshooting
Effective monitoring reveals whether your tuning is working. Start with socket state analysis:
# Show all TCP sockets with numeric addresses
ss -tan
# Count sockets by state
ss -tan | awk '{print $1}' | sort | uniq -c
Look for excessive TIME-WAIT or CLOSE-WAIT states, which indicate application or tuning issues.
Capture packets for deep analysis:
# Capture TCP traffic on port 80
tcpdump -i eth0 -n 'tcp port 80' -w capture.pcap
# Read captured file
tcpdump -r capture.pcap -n | less
Examine kernel network statistics:
cat /proc/net/snmp | grep Tcp
cat /proc/net/netstat | grep TcpExt
Key metrics to watch:
- RetransSegs: TCP retransmissions indicate packet loss
- ListenOverflows: SYN backlog overflows, increase
tcp_max_syn_backlog - TCPBacklogDrop: Socket accept queue overflows, increase
somaxconn
Production Best Practices
Here’s a complete configuration for a high-performance web server handling thousands of concurrent connections:
# /etc/sysctl.d/99-webserver-tuning.conf
# TCP congestion control
net.ipv4.tcp_congestion_control = bbr
net.core.default_qdisc = fq
# Connection handling
net.core.somaxconn = 8192
net.ipv4.tcp_max_syn_backlog = 8192
net.core.netdev_max_backlog = 16384
# TCP socket reuse
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
# Buffer sizes (16MB max)
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# Connection tracking
net.netfilter.nf_conntrack_max = 524288
net.netfilter.nf_conntrack_tcp_timeout_established = 600
# IP local port range
net.ipv4.ip_local_port_range = 10000 65535
Validation script to verify critical parameters:
#!/bin/bash
# check-network-tuning.sh
check_param() {
local param=$1
local expected=$2
local actual=$(sysctl -n $param)
if [ "$actual" == "$expected" ]; then
echo "✓ $param = $actual"
else
echo "✗ $param = $actual (expected: $expected)"
fi
}
check_param "net.ipv4.tcp_congestion_control" "bbr"
check_param "net.core.somaxconn" "8192"
check_param "net.ipv4.tcp_max_syn_backlog" "8192"
check_param "net.ipv4.tcp_tw_reuse" "1"
Test performance before and after tuning:
# On server
iperf3 -s
# On client
iperf3 -c server-ip -t 60 -P 10
The -P 10 flag creates 10 parallel streams, stressing your tuning. Compare throughput, retransmissions, and CPU usage before and after applying your configuration. Properly tuned systems often see 2-4x throughput improvements with lower CPU utilization.
Remember: always test configuration changes in staging before production, and monitor for unintended consequences. Network tuning is workload-specific—what optimizes a web server may harm a database server.