Persistent Segment Tree: Versioned Range Queries
Consider building a collaborative text editor where users can undo to any previous state. Or a database that answers queries like 'what was the sum of values in range [l, r] at timestamp T?' Or a...
Key Insights
- Persistent segment trees achieve O(log n) updates while preserving all historical versions by copying only the O(log n) nodes along the modified path, sharing unchanged subtrees between versions.
- The key implementation shift from standard segment trees is replacing implicit array indexing with explicit node pointers, enabling structural sharing across versions.
- This data structure unlocks powerful query patterns like “k-th smallest in range” and time-travel queries that would otherwise require O(n) space per version.
Introduction & Motivation
Consider building a collaborative text editor where users can undo to any previous state. Or a database that answers queries like “what was the sum of values in range [l, r] at timestamp T?” Or a competitive programming problem asking for the k-th smallest element in any subarray.
The naive approach—copying the entire data structure for each modification—costs O(n) time and space per update. With millions of operations, this becomes untenable.
Persistent data structures solve this elegantly. They preserve all previous versions while maintaining efficient operations. The persistent segment tree, specifically, gives us O(log n) updates and queries across any version, with only O(log n) additional space per update.
This isn’t theoretical elegance for its own sake. Persistent segment trees appear constantly in competitive programming, power features in production databases, and provide the foundation for understanding functional data structures more broadly.
Segment Tree Refresher
A segment tree is a binary tree where each node represents a range of the underlying array. The root covers [0, n-1], its children cover [0, mid] and [mid+1, n-1], and leaves represent individual elements.
This structure supports range queries and point updates in O(log n) time. Here’s a standard implementation for range sum queries:
class SegmentTree {
vector<long long> tree;
int n;
void build(const vector<int>& arr, int node, int start, int end) {
if (start == end) {
tree[node] = arr[start];
return;
}
int mid = (start + end) / 2;
build(arr, 2 * node, start, mid);
build(arr, 2 * node + 1, mid + 1, end);
tree[node] = tree[2 * node] + tree[2 * node + 1];
}
void update(int node, int start, int end, int idx, int val) {
if (start == end) {
tree[node] = val;
return;
}
int mid = (start + end) / 2;
if (idx <= mid)
update(2 * node, start, mid, idx, val);
else
update(2 * node + 1, mid + 1, end, idx, val);
tree[node] = tree[2 * node] + tree[2 * node + 1];
}
long long query(int node, int start, int end, int l, int r) {
if (r < start || end < l) return 0;
if (l <= start && end <= r) return tree[node];
int mid = (start + end) / 2;
return query(2 * node, start, mid, l, r) +
query(2 * node + 1, mid + 1, end, l, r);
}
public:
SegmentTree(const vector<int>& arr) : n(arr.size()), tree(4 * arr.size()) {
build(arr, 1, 0, n - 1);
}
void update(int idx, int val) { update(1, 0, n - 1, idx, val); }
long long query(int l, int r) { return query(1, 0, n - 1, l, r); }
};
This uses implicit indexing: node i has children at 2*i and 2*i+1. Simple and cache-friendly, but it doesn’t support persistence because we’re modifying nodes in place.
The Persistence Concept: Path Copying
Here’s the critical observation: when we update a single element, only O(log n) nodes change—those along the path from root to the target leaf. Every other node remains identical.
Path copying exploits this. Instead of modifying nodes in place, we create new nodes for the entire root-to-leaf path. Crucially, these new nodes point to the unchanged children from the previous version.
Visualize updating index 2 in a tree covering [0, 7]:
Version 0: Version 1 (after update):
[A] [A'] (new root)
/ \ / \
[B] [C] [B'] [C] (C shared!)
/ \ / \ / \ / \
[D][E][F][G] [D][E'] [F][G] (D shared!)
|
(new leaf for index 2)
Version 1 creates only 3 new nodes (A’, B’, E’). Nodes C, D, F, G are shared with Version 0. Both versions remain fully accessible—query Version 0 starting from A, query Version 1 starting from A'.
Implementation Deep Dive
The shift to persistence requires explicit pointers instead of implicit array indexing. Each node stores pointers to its children, and we maintain an array of root pointers for each version.
class PersistentSegmentTree {
struct Node {
long long sum;
Node* left;
Node* right;
Node(long long s = 0, Node* l = nullptr, Node* r = nullptr)
: sum(s), left(l), right(r) {}
};
vector<Node*> roots; // roots[v] = root of version v
int n;
Node* build(const vector<int>& arr, int start, int end) {
if (start == end) {
return new Node(arr[start]);
}
int mid = (start + end) / 2;
Node* leftChild = build(arr, start, mid);
Node* rightChild = build(arr, mid + 1, end);
return new Node(leftChild->sum + rightChild->sum, leftChild, rightChild);
}
Node* update(Node* node, int start, int end, int idx, int val) {
if (start == end) {
return new Node(val); // New leaf with updated value
}
int mid = (start + end) / 2;
if (idx <= mid) {
// Update left subtree, reuse right subtree
Node* newLeft = update(node->left, start, mid, idx, val);
return new Node(newLeft->sum + node->right->sum, newLeft, node->right);
} else {
// Reuse left subtree, update right subtree
Node* newRight = update(node->right, mid + 1, end, idx, val);
return new Node(node->left->sum + newRight->sum, node->left, newRight);
}
}
long long query(Node* node, int start, int end, int l, int r) {
if (!node || r < start || end < l) return 0;
if (l <= start && end <= r) return node->sum;
int mid = (start + end) / 2;
return query(node->left, start, mid, l, r) +
query(node->right, mid + 1, end, l, r);
}
public:
PersistentSegmentTree(const vector<int>& arr) : n(arr.size()) {
roots.push_back(build(arr, 0, n - 1));
}
// Creates new version, returns version number
int update(int version, int idx, int val) {
Node* newRoot = update(roots[version], 0, n - 1, idx, val);
roots.push_back(newRoot);
return roots.size() - 1;
}
long long query(int version, int l, int r) {
return query(roots[version], 0, n - 1, l, r);
}
int latestVersion() { return roots.size() - 1; }
};
Notice how update returns a new Node* at each level. The unchanged subtree pointer is simply copied to the new node—no deep copying required.
Complexity Analysis
Time complexity remains O(log n) for both updates and queries. We traverse the same path as the ephemeral version; we just allocate nodes instead of modifying them.
Space complexity is where persistence shines:
- Initial build: O(n) nodes
- Each update: O(log n) new nodes
- After k updates: O(n + k log n) total nodes
Compare this to naive full copying:
- Each update: O(n) new nodes
- After k updates: O(kn) total nodes
For 100,000 updates on an array of size 100,000, persistence uses roughly 1.7 million nodes versus 10 billion for naive copying. That’s a 6000x improvement.
Practical Applications
K-th Smallest Element in Range
This classic problem asks: given an array, answer queries “what is the k-th smallest element in arr[l..r]?”
The solution uses a persistent segment tree over values (not indices). We process the array left to right, and version i represents a segment tree counting occurrences of each value in arr[0..i-1].
To query range [l, r], we subtract the tree at version l from the tree at version r+1 and binary search for the k-th element:
class KthSmallest {
struct Node {
int count;
Node *left, *right;
Node(int c = 0, Node* l = nullptr, Node* r = nullptr)
: count(c), left(l), right(r) {}
};
vector<Node*> roots;
vector<int> sorted; // Coordinate-compressed values
Node* build(int start, int end) {
if (start == end) return new Node(0);
int mid = (start + end) / 2;
return new Node(0, build(start, mid), build(mid + 1, end));
}
Node* update(Node* node, int start, int end, int idx) {
if (start == end) return new Node(node->count + 1);
int mid = (start + end) / 2;
if (idx <= mid)
return new Node(node->count + 1,
update(node->left, start, mid, idx), node->right);
else
return new Node(node->count + 1,
node->left, update(node->right, mid + 1, end, idx));
}
int query(Node* lo, Node* hi, int start, int end, int k) {
if (start == end) return start;
int mid = (start + end) / 2;
int leftCount = hi->left->count - lo->left->count;
if (k <= leftCount)
return query(lo->left, hi->left, start, mid, k);
else
return query(lo->right, hi->right, mid + 1, end, k - leftCount);
}
public:
KthSmallest(const vector<int>& arr) {
sorted = arr;
sort(sorted.begin(), sorted.end());
sorted.erase(unique(sorted.begin(), sorted.end()), sorted.end());
int m = sorted.size();
roots.push_back(build(0, m - 1));
for (int x : arr) {
int idx = lower_bound(sorted.begin(), sorted.end(), x) - sorted.begin();
roots.push_back(update(roots.back(), 0, m - 1, idx));
}
}
int query(int l, int r, int k) { // 0-indexed, k is 1-indexed
int idx = query(roots[l], roots[r + 1], 0, sorted.size() - 1, k);
return sorted[idx];
}
};
This achieves O(n log n) preprocessing and O(log n) per query—optimal for this problem.
Trade-offs & Alternatives
Persistent segment trees aren’t free. Consider these costs:
Memory overhead: Each node requires two pointers (16 bytes on 64-bit systems) plus the stored value. This 3-4x overhead versus array-based trees matters for memory-constrained environments.
Garbage collection: In languages with GC, old versions accumulate. In C++, you’re responsible for cleanup—which is tricky when nodes are shared across versions. Often, you simply accept the memory leak for competitive programming or implement reference counting for production code.
Cache performance: Pointer-chasing is inherently less cache-friendly than array traversal. For small trees or when you don’t need persistence, stick with the implicit array representation.
Alternatives to consider:
- Rope data structures: Better for string manipulation with version history
- Functional arrays: When you need random access but not range queries
- Partial persistence: If you only need to query old versions (not update them), simpler techniques exist
The distinction between partial and full persistence matters: partial persistence allows queries on any version but updates only on the latest. Full persistence allows updates on any version, creating a version tree rather than a version list. The implementation above is fully persistent.
Persistent segment trees occupy a sweet spot: they’re complex enough to be powerful but simple enough to implement reliably under time pressure. Master this technique, and you’ll find applications everywhere—from competitive programming contests to building your own version-controlled data systems.