R-Tree: Spatial Indexing Structure
B-trees excel at one-dimensional ordering. They can efficiently answer 'find all records where `created_at` is between January and March' because dates have a natural linear order. But ask a B-tree...
Key Insights
- R-trees organize spatial data using hierarchical Minimum Bounding Rectangles (MBRs), enabling efficient multi-dimensional queries that would require full table scans with traditional B-tree indexes.
- The critical performance factor in R-trees is minimizing MBR overlap between sibling nodes—the R*-tree variant addresses this through forced reinsertions and improved split algorithms.
- For production systems, leverage existing implementations in PostGIS, MongoDB, or SQLite rather than building custom R-trees; understand the internals to optimize query patterns and bulk loading strategies.
Why Traditional Indexes Fail for Spatial Data
B-trees excel at one-dimensional ordering. They can efficiently answer “find all records where created_at is between January and March” because dates have a natural linear order. But ask a B-tree to find “all restaurants within 2 kilometers of my location” and it falls apart.
The fundamental problem: spatial data is inherently multi-dimensional. A location has both latitude and longitude. A rectangle has four coordinates. You can’t meaningfully sort these into a single sequence that preserves spatial proximity. Two points adjacent in a sorted list might be on opposite sides of a city.
This matters for countless applications. Maps need to render only visible features. Games must detect collisions between thousands of objects. Ride-sharing apps match drivers to nearby passengers. GIS systems query land parcels intersecting a proposed pipeline route. All of these require indexes designed for spatial relationships.
The R-tree, introduced by Antonin Guttman in 1984, solved this problem by extending B-tree concepts to multiple dimensions. Since then, variants have emerged—R*-trees improve query performance through better node organization, R+-trees eliminate overlap at the cost of duplicating entries, and Hilbert R-trees use space-filling curves for improved packing. The core R-tree concept remains foundational to all of them.
R-Tree Structure and Properties
The R-tree’s central concept is the Minimum Bounding Rectangle (MBR)—the smallest axis-aligned rectangle that completely contains a spatial object. A winding river becomes a simple box. A complex polygon becomes four coordinates. This simplification enables efficient comparisons.
The tree structure mirrors B-trees. Leaf nodes store actual spatial objects (or references to them) along with their MBRs. Internal nodes store MBRs that bound all objects in their subtrees. The root bounds everything in the tree.
from dataclasses import dataclass, field
from typing import List, Optional, Union
@dataclass
class MBR:
min_x: float
min_y: float
max_x: float
max_y: float
def area(self) -> float:
return (self.max_x - self.min_x) * (self.max_y - self.min_y)
def enlargement_needed(self, other: 'MBR') -> float:
merged = self.merge(other)
return merged.area() - self.area()
def merge(self, other: 'MBR') -> 'MBR':
return MBR(
min(self.min_x, other.min_x),
min(self.min_y, other.min_y),
max(self.max_x, other.max_x),
max(self.max_y, other.max_y)
)
def intersects(self, other: 'MBR') -> bool:
return not (self.max_x < other.min_x or other.max_x < self.min_x or
self.max_y < other.min_y or other.max_y < self.min_y)
@dataclass
class RTreeNode:
entries: List[tuple] = field(default_factory=list) # (MBR, child_or_data)
is_leaf: bool = True
max_entries: int = 4
min_entries: int = 2
def is_full(self) -> bool:
return len(self.entries) >= self.max_entries
def compute_mbr(self) -> Optional[MBR]:
if not self.entries:
return None
result = self.entries[0][0]
for mbr, _ in self.entries[1:]:
result = result.merge(mbr)
return result
R-trees maintain balance—all leaves sit at the same depth. Nodes must contain between m and M entries (typically m = M/2), except the root which can have as few as two children. This guarantees logarithmic height relative to the number of entries.
Search Operations
Search exploits the hierarchical MBR structure. For a range query, start at the root and recursively descend into any child whose MBR intersects the search region. Unlike B-trees, multiple paths may need exploration—overlapping MBRs mean the target could exist in several subtrees.
class RTree:
def __init__(self, max_entries: int = 4):
self.root = RTreeNode(max_entries=max_entries)
self.max_entries = max_entries
self.min_entries = max_entries // 2
def search(self, query_mbr: MBR) -> List:
"""Find all entries intersecting the query region."""
results = []
self._search_recursive(self.root, query_mbr, results)
return results
def _search_recursive(self, node: RTreeNode, query: MBR, results: List):
for mbr, child_or_data in node.entries:
if mbr.intersects(query):
if node.is_leaf:
results.append(child_or_data)
else:
self._search_recursive(child_or_data, query, results)
def nearest_neighbor(self, point_x: float, point_y: float, k: int = 1) -> List:
"""Find k nearest neighbors using priority queue."""
import heapq
point_mbr = MBR(point_x, point_y, point_x, point_y)
# Priority queue: (distance, is_leaf_entry, node_or_data)
heap = [(0, False, self.root)]
results = []
while heap and len(results) < k:
dist, is_data, item = heapq.heappop(heap)
if is_data:
results.append((dist, item))
else:
node = item
for mbr, child_or_data in node.entries:
min_dist = self._min_distance(point_x, point_y, mbr)
if node.is_leaf:
heapq.heappush(heap, (min_dist, True, child_or_data))
else:
heapq.heappush(heap, (min_dist, False, child_or_data))
return results
def _min_distance(self, px: float, py: float, mbr: MBR) -> float:
dx = max(mbr.min_x - px, 0, px - mbr.max_x)
dy = max(mbr.min_y - py, 0, py - mbr.max_y)
return (dx * dx + dy * dy) ** 0.5
Nearest-neighbor search uses a priority queue ordered by minimum possible distance to each MBR. This prunes branches that can’t contain closer results than already found.
Insertion Algorithm
Insertion follows the path of least resistance—specifically, the path requiring minimum MBR enlargement. When a leaf overflows, it splits, potentially propagating splits upward.
def insert(self, mbr: MBR, data) -> None:
"""Insert a new entry into the R-tree."""
leaf = self._choose_leaf(self.root, mbr)
leaf.entries.append((mbr, data))
if leaf.is_full():
new_node = self._split_node(leaf)
self._adjust_tree(leaf, new_node)
else:
self._adjust_tree(leaf, None)
def _choose_leaf(self, node: RTreeNode, mbr: MBR) -> RTreeNode:
"""Descend to the leaf that needs least enlargement."""
if node.is_leaf:
return node
best_child = None
best_enlargement = float('inf')
best_area = float('inf')
for child_mbr, child_node in node.entries:
enlargement = child_mbr.enlargement_needed(mbr)
area = child_mbr.area()
if enlargement < best_enlargement or \
(enlargement == best_enlargement and area < best_area):
best_enlargement = enlargement
best_area = area
best_child = child_node
return self._choose_leaf(best_child, mbr)
def _split_node(self, node: RTreeNode) -> RTreeNode:
"""Quadratic split: pick seeds that waste most area together."""
entries = node.entries[:]
# Find two entries that would waste most space together
worst_waste = -float('inf')
seed1_idx, seed2_idx = 0, 1
for i in range(len(entries)):
for j in range(i + 1, len(entries)):
merged = entries[i][0].merge(entries[j][0])
waste = merged.area() - entries[i][0].area() - entries[j][0].area()
if waste > worst_waste:
worst_waste = waste
seed1_idx, seed2_idx = i, j
# Initialize two groups
group1 = [entries[seed1_idx]]
group2 = [entries[seed2_idx]]
remaining = [e for k, e in enumerate(entries)
if k not in (seed1_idx, seed2_idx)]
# Assign remaining entries
while remaining:
# Check if one group needs all remaining to meet minimum
if len(group1) + len(remaining) == self.min_entries:
group1.extend(remaining)
break
if len(group2) + len(remaining) == self.min_entries:
group2.extend(remaining)
break
# Pick entry with greatest preference for one group
mbr1 = self._compute_group_mbr(group1)
mbr2 = self._compute_group_mbr(group2)
best_entry = None
best_diff = -float('inf')
best_group = None
for entry in remaining:
d1 = mbr1.enlargement_needed(entry[0])
d2 = mbr2.enlargement_needed(entry[0])
diff = abs(d1 - d2)
if diff > best_diff:
best_diff = diff
best_entry = entry
best_group = group1 if d1 < d2 else group2
best_group.append(best_entry)
remaining.remove(best_entry)
node.entries = group1
new_node = RTreeNode(is_leaf=node.is_leaf, max_entries=self.max_entries)
new_node.entries = group2
return new_node
The quadratic split algorithm picks two seed entries that would waste the most area if placed together, then assigns remaining entries based on which group they’d enlarge least. Linear split is faster but produces worse trees.
Deletion and Tree Maintenance
Deletion finds the target entry, removes it, and handles underflow by reinserting orphaned entries rather than merging nodes.
def delete(self, mbr: MBR, data) -> bool:
"""Delete an entry and reinsert orphans if underflow occurs."""
leaf = self._find_leaf(self.root, mbr, data)
if leaf is None:
return False
# Remove entry
leaf.entries = [(m, d) for m, d in leaf.entries
if not (m == mbr and d == data)]
orphans = []
self._condense_tree(leaf, orphans)
# Reinsert orphaned entries
for orphan_mbr, orphan_data in orphans:
self.insert(orphan_mbr, orphan_data)
return True
def _condense_tree(self, node: RTreeNode, orphans: List):
"""Propagate underflow handling up the tree."""
# Simplified: in practice, track parent pointers or path
if len(node.entries) < self.min_entries and node != self.root:
# Collect all entries from this node as orphans
orphans.extend(node.entries)
node.entries = [] # Mark for removal from parent
Reinsertion often produces better tree organization than merging because entries get placed in optimal locations based on current tree state.
Performance and Bulk Loading
Building an R-tree one insertion at a time produces suboptimal structure. Bulk loading techniques construct better trees from known datasets.
def bulk_load_str(entries: List[tuple], max_entries: int = 4) -> RTree:
"""Sort-Tile-Recursive bulk loading."""
import math
n = len(entries)
leaf_count = math.ceil(n / max_entries)
slice_count = math.ceil(math.sqrt(leaf_count))
# Sort by x-coordinate center
entries.sort(key=lambda e: (e[0].min_x + e[0].max_x) / 2)
# Divide into vertical slices, sort each by y
slice_size = math.ceil(n / slice_count)
leaves = []
for i in range(0, n, slice_size):
slice_entries = entries[i:i + slice_size]
slice_entries.sort(key=lambda e: (e[0].min_y + e[0].max_y) / 2)
for j in range(0, len(slice_entries), max_entries):
node = RTreeNode(max_entries=max_entries)
node.entries = slice_entries[j:j + max_entries]
leaves.append(node)
# Build tree bottom-up
return _build_levels(leaves, max_entries)
Sort-Tile-Recursive (STR) sorts entries by one dimension, partitions into slices, sorts each slice by the other dimension, then packs into leaves. This minimizes overlap and produces trees with excellent query performance.
Practical Usage with PostGIS
In production, use battle-tested implementations. PostGIS provides R-tree indexes via GiST:
-- Create spatial table
CREATE TABLE restaurants (
id SERIAL PRIMARY KEY,
name TEXT,
location GEOGRAPHY(POINT, 4326)
);
-- Create spatial index (uses R-tree internally)
CREATE INDEX idx_restaurants_location
ON restaurants USING GIST(location);
-- Find restaurants within 2km
SELECT name, ST_Distance(location, ST_MakePoint(-122.4194, 37.7749)::geography) as dist
FROM restaurants
WHERE ST_DWithin(
location,
ST_MakePoint(-122.4194, 37.7749)::geography,
2000 -- meters
)
ORDER BY dist;
-- Analyze index usage
EXPLAIN ANALYZE SELECT * FROM restaurants
WHERE ST_DWithin(location, ST_MakePoint(-122.4194, 37.7749)::geography, 2000);
Choose R-trees when you need range queries on bounding boxes or complex geometries. Use quadtrees for point-heavy datasets with frequent updates. Consider geohashing when you need simple proximity queries and can tolerate approximate results. K-d trees work well for in-memory nearest-neighbor searches on static datasets.
R-trees remain the workhorse of spatial databases because they handle the general case well: mixed geometry types, efficient disk access patterns, and robust query performance across diverse spatial predicates.