Square Root Decomposition: Block-Based Queries

Key Insights

Square root decomposition divides arrays into √n blocks, achieving O(√n) query and update times with dramatically simpler implementation than segment trees
The technique shines when you need a quick, maintainable solution and can tolerate slightly worse asymptotic complexity than logarithmic alternatives
Mo’s algorithm extends this concept to answer offline range queries in O((n + q)√n) time by cleverly ordering queries to minimize pointer movement

Introduction to Square Root Decomposition

Square root decomposition is one of those techniques that feels almost too simple to be useful—until you realize it solves a surprisingly wide range of problems with minimal implementation overhead. The core idea is deceptively straightforward: divide your data into √n blocks, precompute aggregate values for each block, and leverage this structure to answer queries efficiently.

Why √n? This block size creates an optimal balance. With √n blocks of √n elements each, you’ll never traverse more than √n complete blocks, and you’ll never process more than 2√n individual elements at block boundaries. This gives you O(√n) operations per query—not as fast as segment trees’ O(log n), but often fast enough and considerably easier to implement correctly.

You’ll find this technique particularly useful for range sum queries, range minimum/maximum queries, frequency counting problems, and as the foundation for Mo’s algorithm. When you’re in a competitive programming contest or prototyping a solution, sqrt decomposition often gets you to a working answer faster than more sophisticated alternatives.

The Core Concept: Dividing Data into Blocks

The fundamental operation is partitioning your array into contiguous blocks. Each block stores a precomputed aggregate value—a sum, minimum, maximum, or whatever operation your problem requires. When you need to answer a query spanning multiple blocks, you combine the precomputed block values with individual element processing at the boundaries.

Here’s how to set up the basic structure for range sum queries:

#include <vector>
#include <cmath>

class SqrtDecomposition {
private:
    std::vector<int> arr;
    std::vector<long long> block_sum;
    int block_size;
    int num_blocks;

public:
    SqrtDecomposition(const std::vector<int>& input) : arr(input) {
        int n = arr.size();
        block_size = std::max(1, (int)std::sqrt(n));
        num_blocks = (n + block_size - 1) / block_size;
        block_sum.assign(num_blocks, 0);
        
        // Precompute block sums
        for (int i = 0; i < n; i++) {
            block_sum[i / block_size] += arr[i];
        }
    }
    
    int get_block(int index) const {
        return index / block_size;
    }
    
    int block_start(int block) const {
        return block * block_size;
    }
    
    int block_end(int block) const {
        return std::min((block + 1) * block_size, (int)arr.size());
    }
};

The key insight here is the index-to-block mapping: block_index = element_index / block_size. This simple division operation lets you instantly determine which block any element belongs to. The precomputation loop runs in O(n) time, and you store O(√n) additional block values.

Notice the ceiling division for num_blocks—this handles arrays whose size isn’t a perfect square. The last block may contain fewer than block_size elements, which is perfectly fine.

Query Processing: Combining Blocks and Elements

Answering a range query requires handling three cases: the partial left block, complete middle blocks, and the partial right block. The algorithm identifies which blocks are fully contained in the query range and uses their precomputed values directly, while iterating through individual elements only at the boundaries.

long long range_sum(int left, int right) const {
    long long result = 0;
    int left_block = get_block(left);
    int right_block = get_block(right);
    
    if (left_block == right_block) {
        // Query falls within a single block
        for (int i = left; i <= right; i++) {
            result += arr[i];
        }
        return result;
    }
    
    // Process partial left block
    for (int i = left; i < block_end(left_block); i++) {
        result += arr[i];
    }
    
    // Process complete middle blocks
    for (int b = left_block + 1; b < right_block; b++) {
        result += block_sum[b];
    }
    
    // Process partial right block
    for (int i = block_start(right_block); i <= right; i++) {
        result += arr[i];
    }
    
    return result;
}

The single-block case is important to handle separately—it prevents off-by-one errors when both endpoints fall in the same block. For the general case, you process at most block_size - 1 elements on the left, at most block_size - 1 elements on the right, and at most num_blocks - 2 complete blocks in the middle. This guarantees O(√n) time per query.

Handling Updates: Point vs. Range Modifications

Point updates are straightforward: modify the element and recompute the affected block’s aggregate. This takes O(√n) time in the worst case because you might need to iterate through all elements in the block.

void point_update(int index, int new_value) {
    int block = get_block(index);
    
    // Update the block sum
    block_sum[block] -= arr[index];
    block_sum[block] += new_value;
    
    // Update the array
    arr[index] = new_value;
}

For operations where you can compute the delta (like addition), the update becomes O(1):

void point_add(int index, int delta) {
    arr[index] += delta;
    block_sum[get_block(index)] += delta;
}

Range updates require more thought. One approach uses lazy propagation at the block level—store a pending update value for each block and apply it during queries:

std::vector<int> block_lazy;  // Lazy values for each block

void range_add(int left, int right, int delta) {
    int left_block = get_block(left);
    int right_block = get_block(right);
    
    if (left_block == right_block) {
        for (int i = left; i <= right; i++) {
            arr[i] += delta;
        }
        block_sum[left_block] += (long long)delta * (right - left + 1);
        return;
    }
    
    // Partial left block - update individually
    for (int i = left; i < block_end(left_block); i++) {
        arr[i] += delta;
    }
    block_sum[left_block] += (long long)delta * (block_end(left_block) - left);
    
    // Complete middle blocks - use lazy propagation
    for (int b = left_block + 1; b < right_block; b++) {
        block_lazy[b] += delta;
        block_sum[b] += (long long)delta * block_size;
    }
    
    // Partial right block - update individually
    for (int i = block_start(right_block); i <= right; i++) {
        arr[i] += delta;
    }
    block_sum[right_block] += (long long)delta * (right - block_start(right_block) + 1);
}

Classic Application: Mo’s Algorithm for Offline Queries

Mo’s algorithm is perhaps the most elegant application of sqrt decomposition. Given multiple range queries that can be answered offline (all at once, not necessarily in order), you sort them cleverly to minimize the total work of maintaining a sliding window.

The sorting criterion: order queries by their left endpoint’s block, breaking ties by right endpoint. This ensures that as you process queries, the right pointer moves at most O(n) times per block of left endpoints, giving O(n√n) total right pointer movement.

#include <algorithm>
#include <set>

struct Query {
    int left, right, index;
};

std::vector<int> mo_distinct_count(const std::vector<int>& arr, 
                                    std::vector<Query>& queries) {
    int n = arr.size();
    int q = queries.size();
    int block_size = std::max(1, (int)std::sqrt(n));
    
    // Sort queries using Mo's ordering
    std::sort(queries.begin(), queries.end(), 
        [block_size](const Query& a, const Query& b) {
            int block_a = a.left / block_size;
            int block_b = b.left / block_size;
            if (block_a != block_b) return block_a < block_b;
            // Optimization: alternate direction for even/odd blocks
            return (block_a & 1) ? (a.right > b.right) : (a.right < b.right);
        });
    
    std::vector<int> freq(100001, 0);  // Frequency array
    std::vector<int> answers(q);
    int distinct = 0;
    int curr_left = 0, curr_right = -1;
    
    auto add = [&](int idx) {
        if (freq[arr[idx]]++ == 0) distinct++;
    };
    
    auto remove = [&](int idx) {
        if (--freq[arr[idx]] == 0) distinct--;
    };
    
    for (const auto& query : queries) {
        while (curr_right < query.right) add(++curr_right);
        while (curr_right > query.right) remove(curr_right--);
        while (curr_left < query.left) remove(curr_left++);
        while (curr_left > query.left) add(--curr_left);
        
        answers[query.index] = distinct;
    }
    
    return answers;
}

The alternating direction optimization (sorting odd blocks by decreasing right endpoint) reduces constant factors significantly by avoiding the right pointer’s return journey.

Complexity Analysis and Trade-offs

Let’s be precise about the complexity:

Preprocessing: O(n) time, O(√n) space for block aggregates
Point query: O(1) if you just need the element, O(√n) for range queries
Point update: O(1) to O(√n) depending on whether you can compute deltas
Range update: O(√n) with lazy propagation
Mo’s algorithm: O((n + q)√n) for q queries

Compared to segment trees (O(log n) per operation) or Fenwick trees (O(log n) for prefix operations), sqrt decomposition is asymptotically slower. However, it wins in several scenarios:

Implementation speed: You can code sqrt decomposition correctly in 5 minutes; segment trees take longer and have more edge cases
Memory locality: Blocks are contiguous, which can improve cache performance
Certain update patterns: When updates affect entire blocks uniformly, lazy propagation at the block level is simpler
Mo’s algorithm problems: No tree structure handles offline range queries as elegantly

Practical Considerations and Optimizations

The theoretical optimal block size of √n isn’t always best in practice. If your workload has more queries than updates, slightly larger blocks reduce query time. If updates dominate, smaller blocks minimize recomputation.

int optimal_block_size(int n, int num_queries, int num_updates) {
    if (num_updates == 0) {
        return n;  // No updates: one big block
    }
    
    // Balance query cost (n/block_size + block_size) 
    // against update cost (block_size)
    double ratio = (double)num_queries / num_updates;
    int block_size = (int)std::sqrt(n * ratio);
    
    // Clamp to reasonable bounds
    return std::max(1, std::min(block_size, n));
}

For cache efficiency, ensure your block size aligns with cache line sizes when possible. On modern CPUs, 64-byte cache lines mean blocks of 16 integers fit perfectly. If √n is close to a cache-friendly size, round to it.

Edge cases to watch: empty arrays, single-element arrays, and query ranges that span the entire array. Always validate that left <= right and both are within bounds before processing.

Square root decomposition won’t win any asymptotic complexity contests, but it’s a reliable tool that solves real problems with minimal fuss. Master it, and you’ll find yourself reaching for it whenever you need a quick, correct solution to range query problems.