R-Tree: Spatial Data Indexing
Traditional B-trees excel at one-dimensional data. Finding all users with IDs between 1000 and 2000 is straightforward—the data has a natural ordering. But what about finding all restaurants within 5...
Key Insights
- R-Trees organize spatial data using hierarchical minimum bounding rectangles (MBRs), enabling efficient range queries, nearest-neighbor searches, and collision detection in O(log n) average time
- The critical performance factor is minimizing MBR overlap during insertion—poorly structured R-Trees can degrade to linear search performance
- For most production use cases, use an existing library (RBush, libspatialindex, PostGIS) rather than implementing from scratch, but understanding the internals helps you tune performance and choose the right variant
The Spatial Indexing Problem
Traditional B-trees excel at one-dimensional data. Finding all users with IDs between 1000 and 2000 is straightforward—the data has a natural ordering. But what about finding all restaurants within 5 miles of your location? Or detecting which game objects collide with a player’s hitbox?
Spatial queries operate on multiple dimensions simultaneously. You can’t sort points by latitude and expect longitude queries to work efficiently. A naive approach—checking every object against your query region—gives you O(n) performance. For a database with millions of geographic points, that’s unacceptable.
R-Trees solve this by organizing spatial objects into hierarchical bounding boxes. They’re the backbone of PostGIS, SQLite’s spatial extension, and countless game engines. If you’ve ever used a map application that instantly shows nearby restaurants, you’ve benefited from R-Tree indexing.
R-Tree Fundamentals
An R-Tree is a balanced tree where each node contains a set of entries. Leaf nodes store actual spatial objects (points, rectangles, polygons) along with their minimum bounding rectangles (MBRs). Internal nodes store MBRs that completely enclose all entries in their child nodes.
The key insight: if a query region doesn’t intersect a node’s MBR, it can’t intersect any objects in that subtree. This allows aggressive pruning during search.
from dataclasses import dataclass, field
from typing import List, Optional, Any
@dataclass
class Rectangle:
min_x: float
min_y: float
max_x: float
max_y: float
def area(self) -> float:
return (self.max_x - self.min_x) * (self.max_y - self.min_y)
def intersects(self, other: 'Rectangle') -> bool:
return not (self.max_x < other.min_x or
self.min_x > other.max_x or
self.max_y < other.min_y or
self.min_y > other.max_y)
def enlarge_to_include(self, other: 'Rectangle') -> 'Rectangle':
return Rectangle(
min(self.min_x, other.min_x),
min(self.min_y, other.min_y),
max(self.max_x, other.max_x),
max(self.max_y, other.max_y)
)
@dataclass
class RTreeNode:
mbr: Optional[Rectangle] = None
entries: List['RTreeEntry'] = field(default_factory=list)
is_leaf: bool = True
max_entries: int = 4 # M parameter
min_entries: int = 2 # m parameter (typically M/2)
@dataclass
class RTreeEntry:
mbr: Rectangle
child: Optional[RTreeNode] = None # For internal nodes
data: Any = None # For leaf nodes (the actual object)
R-Trees maintain balance through two invariants: every leaf is at the same depth, and every node (except the root) contains between m and M entries. Typical values are M=50 for disk-based trees and M=4-16 for in-memory trees.
Core Operations: Search
Range queries are the R-Tree’s bread and butter. The algorithm recursively descends the tree, visiting only nodes whose MBRs intersect the query region.
def range_search(node: RTreeNode, query: Rectangle) -> List[Any]:
results = []
for entry in node.entries:
if not entry.mbr.intersects(query):
continue
if node.is_leaf:
# Entry contains actual data
results.append(entry.data)
else:
# Recurse into child node
results.extend(range_search(entry.child, query))
return results
def point_search(node: RTreeNode, x: float, y: float) -> List[Any]:
"""Find all objects containing the given point."""
point_rect = Rectangle(x, y, x, y)
return range_search(node, point_rect)
Performance depends heavily on MBR overlap. In the ideal case—no overlap between sibling nodes—search follows a single path and runs in O(log n). With significant overlap, multiple paths must be explored, potentially degrading to O(n).
Nearest-neighbor search requires a priority queue approach. You maintain candidates sorted by minimum possible distance, pruning branches that can’t contain closer objects than your current best.
import heapq
from math import sqrt
def min_distance_to_rect(x: float, y: float, rect: Rectangle) -> float:
dx = max(rect.min_x - x, 0, x - rect.max_x)
dy = max(rect.min_y - y, 0, y - rect.max_y)
return sqrt(dx*dx + dy*dy)
def nearest_neighbor(root: RTreeNode, x: float, y: float) -> Any:
# Priority queue: (distance, is_leaf_entry, node_or_data)
heap = [(0, False, root)]
best_dist = float('inf')
best_result = None
while heap:
dist, is_data, item = heapq.heappop(heap)
if dist >= best_dist:
continue
if is_data:
best_dist = dist
best_result = item
continue
node = item
for entry in node.entries:
entry_dist = min_distance_to_rect(x, y, entry.mbr)
if entry_dist < best_dist:
if node.is_leaf:
heapq.heappush(heap, (entry_dist, True, entry.data))
else:
heapq.heappush(heap, (entry_dist, False, entry.child))
return best_result
Insertion and Tree Building
Insertion follows a top-down approach: find the best leaf node, insert the entry, and split if necessary.
The critical decision is choosing which subtree to descend. The standard heuristic: pick the child whose MBR requires the least enlargement to accommodate the new entry. Ties are broken by choosing the smaller MBR.
def choose_subtree(node: RTreeNode, entry_mbr: Rectangle) -> RTreeEntry:
"""Choose the best child node for insertion."""
best_entry = None
best_enlargement = float('inf')
best_area = float('inf')
for entry in node.entries:
enlarged = entry.mbr.enlarge_to_include(entry_mbr)
enlargement = enlarged.area() - entry.mbr.area()
if (enlargement < best_enlargement or
(enlargement == best_enlargement and entry.mbr.area() < best_area)):
best_entry = entry
best_enlargement = enlargement
best_area = entry.mbr.area()
return best_entry
def insert(root: RTreeNode, mbr: Rectangle, data: Any) -> RTreeNode:
"""Insert a new entry, returning the (possibly new) root."""
new_entry = RTreeEntry(mbr=mbr, data=data)
# Find leaf and collect path
path = []
node = root
while not node.is_leaf:
best = choose_subtree(node, mbr)
path.append((node, best))
node = best.child
# Insert into leaf
node.entries.append(new_entry)
update_mbr(node)
# Handle overflow by splitting up the tree
split_node = None
if len(node.entries) > node.max_entries:
split_node = quadratic_split(node)
# Propagate splits upward
while path and split_node:
parent, parent_entry = path.pop()
new_entry = RTreeEntry(mbr=compute_mbr(split_node), child=split_node)
parent.entries.append(new_entry)
update_mbr(parent)
if len(parent.entries) > parent.max_entries:
split_node = quadratic_split(parent)
node = parent
else:
split_node = None
# Root split: create new root
if split_node:
new_root = RTreeNode(is_leaf=False)
new_root.entries = [
RTreeEntry(mbr=compute_mbr(root), child=root),
RTreeEntry(mbr=compute_mbr(split_node), child=split_node)
]
update_mbr(new_root)
return new_root
return root
The quadratic split algorithm picks two seed entries that would waste the most area if placed together, then greedily assigns remaining entries to minimize total MBR area.
def quadratic_split(node: RTreeNode) -> RTreeNode:
"""Split node using quadratic algorithm, return new sibling."""
entries = node.entries
# Pick seeds: entries with maximum wasted area if combined
worst_waste = -float('inf')
seed1, seed2 = 0, 1
for i in range(len(entries)):
for j in range(i + 1, len(entries)):
combined = entries[i].mbr.enlarge_to_include(entries[j].mbr)
waste = combined.area() - entries[i].mbr.area() - entries[j].mbr.area()
if waste > worst_waste:
worst_waste = waste
seed1, seed2 = i, j
# Create two groups
group1 = [entries[seed1]]
group2 = [entries[seed2]]
remaining = [e for i, e in enumerate(entries) if i not in (seed1, seed2)]
mbr1 = entries[seed1].mbr
mbr2 = entries[seed2].mbr
# Assign remaining entries
for entry in remaining:
# Check minimum fill constraint
if len(group1) + len(remaining) == node.min_entries:
group1.extend(remaining)
break
if len(group2) + len(remaining) == node.min_entries:
group2.extend(remaining)
break
# Assign to group requiring less enlargement
enlarge1 = mbr1.enlarge_to_include(entry.mbr).area() - mbr1.area()
enlarge2 = mbr2.enlarge_to_include(entry.mbr).area() - mbr2.area()
if enlarge1 < enlarge2:
group1.append(entry)
mbr1 = mbr1.enlarge_to_include(entry.mbr)
else:
group2.append(entry)
mbr2 = mbr2.enlarge_to_include(entry.mbr)
node.entries = group1
sibling = RTreeNode(is_leaf=node.is_leaf)
sibling.entries = group2
return sibling
Deletion and Tree Maintenance
Deletion requires finding the entry, removing it, and handling potential underflow. When a node drops below minimum capacity, its entries are reinserted rather than merged—this often improves tree quality.
def delete(root: RTreeNode, mbr: Rectangle, data: Any) -> RTreeNode:
"""Delete an entry, returning the (possibly new) root."""
orphans = []
def delete_recursive(node: RTreeNode, target_mbr: Rectangle) -> bool:
for i, entry in enumerate(node.entries):
if not entry.mbr.intersects(target_mbr):
continue
if node.is_leaf:
if entry.data == data:
node.entries.pop(i)
return True
else:
if delete_recursive(entry.child, target_mbr):
# Check for underflow
if len(entry.child.entries) < node.min_entries:
orphans.extend(entry.child.entries)
node.entries.pop(i)
else:
entry.mbr = compute_mbr(entry.child)
return True
return False
delete_recursive(root, mbr)
# Reinsert orphaned entries
for orphan in orphans:
if orphan.child: # Internal node entry
reinsert_subtree(root, orphan)
else: # Leaf entry
root = insert(root, orphan.mbr, orphan.data)
# Shrink tree if root has single child
while not root.is_leaf and len(root.entries) == 1:
root = root.entries[0].child
return root
R-Tree Variants and Practical Considerations
The R*-tree improves on the original with better split algorithms and forced reinsertion. When a node overflows, instead of immediately splitting, it reinserts a portion of entries. This reorganization often reduces overlap significantly.
For bulk loading, the Sort-Tile-Recursive (STR) algorithm sorts objects by one dimension, partitions them into tiles, then recursively processes each tile. This produces trees with minimal overlap and near-optimal packing.
When to use R-Trees vs alternatives:
- Quad-trees: Simpler, good for uniformly distributed points, but struggle with clustered data
- K-d trees: Excellent for nearest-neighbor in moderate dimensions, but don’t handle rectangles well
- R-Trees: Best general choice for rectangles, polygons, and range queries
For production use, leverage existing libraries:
# Using RTree library (pip install rtree)
from rtree import index
# Create index
idx = index.Index()
# Insert points (id, (minx, miny, maxx, maxy), object)
restaurants = [
(1, (-73.99, 40.73, -73.99, 40.73), "Joe's Pizza"),
(2, (-73.98, 40.75, -73.98, 40.75), "Shake Shack"),
(3, (-74.00, 40.72, -74.00, 40.72), "Katz's Deli"),
]
for id, coords, name in restaurants:
idx.insert(id, coords, obj=name)
# Range query: find restaurants in bounding box
nearby = list(idx.intersection((-74.0, 40.72, -73.98, 40.74), objects=True))
for item in nearby:
print(f"{item.object} at {item.bbox}")
# Nearest neighbor
nearest = list(idx.nearest((-73.99, 40.74, -73.99, 40.74), 1, objects=True))
Choose node sizes based on your storage medium. For disk-based trees, match your page size (typically 4KB). For in-memory trees, smaller nodes (4-16 entries) often perform better due to cache efficiency.
R-Trees aren’t magic—they work best when your data has spatial locality and your queries are selective. For queries that match most of the dataset, a sequential scan might actually win.