Unrolled Linked List: Cache-Friendly Linked Structure

Every computer science student learns linked lists as a fundamental data structure. They offer O(1) insertion and deletion at known positions, dynamic sizing, and conceptual simplicity. What...

Key Insights

  • Unrolled linked lists store multiple elements per node, dramatically reducing cache misses during traversal by exploiting spatial locality—often achieving 2-5x faster iteration than traditional linked lists.
  • The optimal block size typically aligns with cache line sizes (64 bytes on most modern CPUs), but must be balanced against insertion/deletion frequency since larger blocks increase the cost of these operations.
  • This data structure excels in read-heavy scenarios with sequential access patterns, making it ideal for text editors, database cursors, and any application iterating over large ordered datasets.

The Cache Problem with Traditional Linked Lists

Every computer science student learns linked lists as a fundamental data structure. They offer O(1) insertion and deletion at known positions, dynamic sizing, and conceptual simplicity. What textbooks often gloss over is their catastrophic real-world performance.

The culprit is pointer chasing. Each node in a traditional linked list sits at an arbitrary memory location. When you traverse the list, every single element access potentially triggers a cache miss. Modern CPUs fetch data in cache lines—typically 64 bytes. With a standard linked list, you’re fetching an entire cache line just to read one pointer and one element, then immediately jumping to a completely different memory region.

This matters more than algorithmic complexity in practice. A cache miss to main memory costs 100+ CPU cycles. When your “O(n) traversal” triggers n cache misses, you’re burning thousands of cycles that a contiguous array would avoid entirely.

Unrolled linked lists solve this by storing multiple elements in each node, giving you the flexibility of linked structures with much of the cache efficiency of arrays.

What is an Unrolled Linked List?

An unrolled linked list is a hybrid data structure where each node contains a small array of elements instead of a single element. You maintain a linked list of these array-containing nodes, combining the benefits of both approaches.

The structure looks like this:

template <typename T, size_t BlockSize = 16>
struct UnrolledNode {
    T elements[BlockSize];
    size_t count;              // Number of valid elements in this node
    UnrolledNode* next;
    UnrolledNode* prev;        // Optional: for doubly-linked variant
    
    UnrolledNode() : count(0), next(nullptr), prev(nullptr) {}
};

Each node stores up to BlockSize elements contiguously in memory. The count field tracks how many slots are actually used. When you traverse the list, you iterate through all elements in a node before following the pointer to the next node.

This creates an interesting trade-off space. Pure arrays give you perfect cache locality but expensive insertions. Pure linked lists give you cheap insertions but terrible cache behavior. Unrolled linked lists sit in the middle—you get good cache locality within each block, and insertions only require shifting elements within a single block rather than the entire collection.

How Cache Locality Improves Performance

Modern CPUs don’t fetch individual bytes from memory. They fetch cache lines—64-byte chunks on x86-64 and most ARM processors. When you access one byte, the surrounding 63 bytes come along for free.

With a traditional linked list storing 8-byte integers and 8-byte pointers, each node occupies at least 16 bytes. A cache line might hold 4 nodes, but those nodes aren’t adjacent in memory—they’re scattered across the heap. Every node access is a potential cache miss.

With an unrolled linked list using 16-element blocks of 8-byte integers, each node holds 128 bytes of data. When you access the first element, the next 7 elements are already in L1 cache. You’ve reduced cache misses by nearly 8x for sequential traversal.

Here’s a benchmark demonstrating the difference:

#include <chrono>
#include <list>
#include <iostream>

template <typename T, size_t BlockSize>
class UnrolledLinkedList; // Forward declaration

void benchmark_traversal(size_t element_count) {
    // Standard linked list
    std::list<int> standard_list;
    for (size_t i = 0; i < element_count; ++i) {
        standard_list.push_back(i);
    }
    
    auto start = std::chrono::high_resolution_clock::now();
    long long sum = 0;
    for (int val : standard_list) {
        sum += val;
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto standard_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
    
    // Unrolled linked list (using our implementation)
    UnrolledLinkedList<int, 64> unrolled_list;
    for (size_t i = 0; i < element_count; ++i) {
        unrolled_list.push_back(i);
    }
    
    start = std::chrono::high_resolution_clock::now();
    sum = 0;
    for (int val : unrolled_list) {
        sum += val;
    }
    end = std::chrono::high_resolution_clock::now();
    auto unrolled_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
    
    std::cout << "Standard list: " << standard_time.count() << " μs\n";
    std::cout << "Unrolled list: " << unrolled_time.count() << " μs\n";
    std::cout << "Speedup: " << (double)standard_time.count() / unrolled_time.count() << "x\n";
}

On typical hardware with 1 million elements, expect the unrolled variant to be 3-5x faster for pure traversal. The gap widens as element count grows and the standard list’s nodes become increasingly scattered across memory.

The key complexity in unrolled linked lists comes from maintaining reasonable block occupancy. When a block overflows, you split it. When blocks become too sparse, you merge them.

Insertion with Block Splitting

template <typename T, size_t BlockSize>
void UnrolledLinkedList<T, BlockSize>::insert(size_t index, const T& value) {
    if (index > size_) {
        throw std::out_of_range("Index out of bounds");
    }
    
    // Find the node and local index
    UnrolledNode<T, BlockSize>* node = head_;
    size_t remaining = index;
    
    while (node && remaining >= node->count) {
        remaining -= node->count;
        node = node->next;
    }
    
    if (!node) {
        // Inserting at end, may need new node
        node = tail_ ? tail_ : create_node();
    }
    
    // Check if node is full
    if (node->count == BlockSize) {
        // Split the node
        auto* new_node = create_node();
        size_t mid = BlockSize / 2;
        
        // Move second half to new node
        for (size_t i = mid; i < BlockSize; ++i) {
            new_node->elements[i - mid] = std::move(node->elements[i]);
        }
        new_node->count = BlockSize - mid;
        node->count = mid;
        
        // Link new node
        new_node->next = node->next;
        new_node->prev = node;
        if (node->next) node->next->prev = new_node;
        node->next = new_node;
        if (node == tail_) tail_ = new_node;
        
        // Determine which node gets the insertion
        if (remaining > mid) {
            node = new_node;
            remaining -= mid;
        }
    }
    
    // Shift elements and insert
    for (size_t i = node->count; i > remaining; --i) {
        node->elements[i] = std::move(node->elements[i - 1]);
    }
    node->elements[remaining] = value;
    node->count++;
    size_++;
}

Deletion with Merging

template <typename T, size_t BlockSize>
void UnrolledLinkedList<T, BlockSize>::erase(size_t index) {
    if (index >= size_) {
        throw std::out_of_range("Index out of bounds");
    }
    
    UnrolledNode<T, BlockSize>* node = head_;
    size_t remaining = index;
    
    while (remaining >= node->count) {
        remaining -= node->count;
        node = node->next;
    }
    
    // Shift elements left
    for (size_t i = remaining; i < node->count - 1; ++i) {
        node->elements[i] = std::move(node->elements[i + 1]);
    }
    node->count--;
    size_--;
    
    // Merge with next node if too sparse
    constexpr size_t merge_threshold = BlockSize / 4;
    if (node->count < merge_threshold && node->next) {
        auto* next = node->next;
        
        if (node->count + next->count <= BlockSize) {
            // Merge next into current
            for (size_t i = 0; i < next->count; ++i) {
                node->elements[node->count + i] = std::move(next->elements[i]);
            }
            node->count += next->count;
            
            // Unlink and delete next
            node->next = next->next;
            if (next->next) next->next->prev = node;
            if (next == tail_) tail_ = node;
            delete next;
        }
    }
    
    // Remove empty nodes
    if (node->count == 0 && node != head_) {
        if (node->prev) node->prev->next = node->next;
        if (node->next) node->next->prev = node->prev;
        if (node == tail_) tail_ = node->prev;
        delete node;
    }
}

Choosing the Right Block Size

Block size selection significantly impacts performance. Consider these factors:

Cache line alignment: On x86-64, cache lines are 64 bytes. If your elements are 8 bytes each, 8 elements per block fits perfectly. However, you also need space for the count and pointers, so practical block sizes are often slightly larger.

Element size: Larger elements mean fewer should fit per block. For 64-byte structs, even 4-8 elements per block provides substantial benefit over standard linked lists.

Operation mix: Read-heavy workloads benefit from larger blocks (more cache hits per traversal). Write-heavy workloads prefer smaller blocks (less shifting on insert/delete).

General guidelines:

  • For small elements (≤8 bytes): 32-64 elements per block
  • For medium elements (8-64 bytes): 8-16 elements per block
  • For large elements (>64 bytes): 4-8 elements per block

Trade-offs and When to Use

Memory overhead: Each node carries pointer overhead amortized across multiple elements. With 16 elements per node, you’re adding ~1 byte of overhead per element versus ~16 bytes for standard linked lists.

Insertion complexity: O(n) in the worst case for indexed insertion (finding the node), but only O(BlockSize) for the actual insertion within a node.

Best use cases:

  • Text editors (storing lines or characters)
  • Database result cursors
  • Undo/redo stacks with frequent iteration
  • Any ordered collection with sequential access patterns

Avoid when:

  • Random access dominates (use a vector)
  • Insertions always happen at known endpoints (use a deque)
  • Elements are huge and blocks would be tiny anyway

Complete Implementation

#include <stdexcept>
#include <iterator>

template <typename T, size_t BlockSize = 16>
class UnrolledLinkedList {
private:
    struct Node {
        T elements[BlockSize];
        size_t count = 0;
        Node* next = nullptr;
        Node* prev = nullptr;
    };
    
    Node* head_ = nullptr;
    Node* tail_ = nullptr;
    size_t size_ = 0;
    
    Node* create_node() {
        auto* node = new Node();
        if (!head_) {
            head_ = tail_ = node;
        }
        return node;
    }

public:
    class iterator {
        Node* node_;
        size_t index_;
    public:
        using iterator_category = std::forward_iterator_tag;
        using value_type = T;
        using difference_type = std::ptrdiff_t;
        using pointer = T*;
        using reference = T&;
        
        iterator(Node* n, size_t i) : node_(n), index_(i) {}
        
        reference operator*() { return node_->elements[index_]; }
        pointer operator->() { return &node_->elements[index_]; }
        
        iterator& operator++() {
            if (++index_ >= node_->count) {
                node_ = node_->next;
                index_ = 0;
            }
            return *this;
        }
        
        bool operator==(const iterator& other) const {
            return node_ == other.node_ && index_ == other.index_;
        }
        bool operator!=(const iterator& other) const {
            return !(*this == other);
        }
    };
    
    iterator begin() { return iterator(head_, 0); }
    iterator end() { return iterator(nullptr, 0); }
    
    void push_back(const T& value) {
        if (!tail_ || tail_->count == BlockSize) {
            auto* node = create_node();
            if (tail_ && tail_ != node) {
                tail_->next = node;
                node->prev = tail_;
                tail_ = node;
            }
        }
        tail_->elements[tail_->count++] = value;
        size_++;
    }
    
    size_t size() const { return size_; }
    bool empty() const { return size_ == 0; }
    
    ~UnrolledLinkedList() {
        while (head_) {
            Node* next = head_->next;
            delete head_;
            head_ = next;
        }
    }
};

Unrolled linked lists won’t replace vectors or standard linked lists entirely. But when you need the flexibility of a linked structure without sacrificing cache performance, they’re an excellent tool to have in your arsenal.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.