R-Tree: Spatial Data Indexing

Key Insights

R-Trees organize spatial data using hierarchical minimum bounding rectangles (MBRs), enabling efficient range queries, nearest-neighbor searches, and collision detection in O(log n) average time
The critical performance factor is minimizing MBR overlap during insertion—poorly structured R-Trees can degrade to linear search performance
For most production use cases, use an existing library (RBush, libspatialindex, PostGIS) rather than implementing from scratch, but understanding the internals helps you tune performance and choose the right variant

The Spatial Indexing Problem

Traditional B-trees excel at one-dimensional data. Finding all users with IDs between 1000 and 2000 is straightforward—the data has a natural ordering. But what about finding all restaurants within 5 miles of your location? Or detecting which game objects collide with a player’s hitbox?

Spatial queries operate on multiple dimensions simultaneously. You can’t sort points by latitude and expect longitude queries to work efficiently. A naive approach—checking every object against your query region—gives you O(n) performance. For a database with millions of geographic points, that’s unacceptable.

R-Trees solve this by organizing spatial objects into hierarchical bounding boxes. They’re the backbone of PostGIS, SQLite’s spatial extension, and countless game engines. If you’ve ever used a map application that instantly shows nearby restaurants, you’ve benefited from R-Tree indexing.

R-Tree Fundamentals

An R-Tree is a balanced tree where each node contains a set of entries. Leaf nodes store actual spatial objects (points, rectangles, polygons) along with their minimum bounding rectangles (MBRs). Internal nodes store MBRs that completely enclose all entries in their child nodes.

The key insight: if a query region doesn’t intersect a node’s MBR, it can’t intersect any objects in that subtree. This allows aggressive pruning during search.

from dataclasses import dataclass, field
from typing import List, Optional, Any

@dataclass
class Rectangle:
    min_x: float
    min_y: float
    max_x: float
    max_y: float
    
    def area(self) -> float:
        return (self.max_x - self.min_x) * (self.max_y - self.min_y)
    
    def intersects(self, other: 'Rectangle') -> bool:
        return not (self.max_x < other.min_x or 
                    self.min_x > other.max_x or
                    self.max_y < other.min_y or 
                    self.min_y > other.max_y)
    
    def enlarge_to_include(self, other: 'Rectangle') -> 'Rectangle':
        return Rectangle(
            min(self.min_x, other.min_x),
            min(self.min_y, other.min_y),
            max(self.max_x, other.max_x),
            max(self.max_y, other.max_y)
        )

@dataclass
class RTreeNode:
    mbr: Optional[Rectangle] = None
    entries: List['RTreeEntry'] = field(default_factory=list)
    is_leaf: bool = True
    
    max_entries: int = 4  # M parameter
    min_entries: int = 2  # m parameter (typically M/2)

@dataclass
class RTreeEntry:
    mbr: Rectangle
    child: Optional[RTreeNode] = None  # For internal nodes
    data: Any = None  # For leaf nodes (the actual object)

R-Trees maintain balance through two invariants: every leaf is at the same depth, and every node (except the root) contains between m and M entries. Typical values are M=50 for disk-based trees and M=4-16 for in-memory trees.

Core Operations: Search

Range queries are the R-Tree’s bread and butter. The algorithm recursively descends the tree, visiting only nodes whose MBRs intersect the query region.

def range_search(node: RTreeNode, query: Rectangle) -> List[Any]:
    results = []
    
    for entry in node.entries:
        if not entry.mbr.intersects(query):
            continue
            
        if node.is_leaf:
            # Entry contains actual data
            results.append(entry.data)
        else:
            # Recurse into child node
            results.extend(range_search(entry.child, query))
    
    return results

def point_search(node: RTreeNode, x: float, y: float) -> List[Any]:
    """Find all objects containing the given point."""
    point_rect = Rectangle(x, y, x, y)
    return range_search(node, point_rect)

Performance depends heavily on MBR overlap. In the ideal case—no overlap between sibling nodes—search follows a single path and runs in O(log n). With significant overlap, multiple paths must be explored, potentially degrading to O(n).

Nearest-neighbor search requires a priority queue approach. You maintain candidates sorted by minimum possible distance, pruning branches that can’t contain closer objects than your current best.

import heapq
from math import sqrt

def min_distance_to_rect(x: float, y: float, rect: Rectangle) -> float:
    dx = max(rect.min_x - x, 0, x - rect.max_x)
    dy = max(rect.min_y - y, 0, y - rect.max_y)
    return sqrt(dx*dx + dy*dy)

def nearest_neighbor(root: RTreeNode, x: float, y: float) -> Any:
    # Priority queue: (distance, is_leaf_entry, node_or_data)
    heap = [(0, False, root)]
    best_dist = float('inf')
    best_result = None
    
    while heap:
        dist, is_data, item = heapq.heappop(heap)
        
        if dist >= best_dist:
            continue
            
        if is_data:
            best_dist = dist
            best_result = item
            continue
        
        node = item
        for entry in node.entries:
            entry_dist = min_distance_to_rect(x, y, entry.mbr)
            if entry_dist < best_dist:
                if node.is_leaf:
                    heapq.heappush(heap, (entry_dist, True, entry.data))
                else:
                    heapq.heappush(heap, (entry_dist, False, entry.child))
    
    return best_result

Insertion and Tree Building

Insertion follows a top-down approach: find the best leaf node, insert the entry, and split if necessary.

The critical decision is choosing which subtree to descend. The standard heuristic: pick the child whose MBR requires the least enlargement to accommodate the new entry. Ties are broken by choosing the smaller MBR.

def choose_subtree(node: RTreeNode, entry_mbr: Rectangle) -> RTreeEntry:
    """Choose the best child node for insertion."""
    best_entry = None
    best_enlargement = float('inf')
    best_area = float('inf')
    
    for entry in node.entries:
        enlarged = entry.mbr.enlarge_to_include(entry_mbr)
        enlargement = enlarged.area() - entry.mbr.area()
        
        if (enlargement < best_enlargement or 
            (enlargement == best_enlargement and entry.mbr.area() < best_area)):
            best_entry = entry
            best_enlargement = enlargement
            best_area = entry.mbr.area()
    
    return best_entry

def insert(root: RTreeNode, mbr: Rectangle, data: Any) -> RTreeNode:
    """Insert a new entry, returning the (possibly new) root."""
    new_entry = RTreeEntry(mbr=mbr, data=data)
    
    # Find leaf and collect path
    path = []
    node = root
    while not node.is_leaf:
        best = choose_subtree(node, mbr)
        path.append((node, best))
        node = best.child
    
    # Insert into leaf
    node.entries.append(new_entry)
    update_mbr(node)
    
    # Handle overflow by splitting up the tree
    split_node = None
    if len(node.entries) > node.max_entries:
        split_node = quadratic_split(node)
    
    # Propagate splits upward
    while path and split_node:
        parent, parent_entry = path.pop()
        new_entry = RTreeEntry(mbr=compute_mbr(split_node), child=split_node)
        parent.entries.append(new_entry)
        update_mbr(parent)
        
        if len(parent.entries) > parent.max_entries:
            split_node = quadratic_split(parent)
            node = parent
        else:
            split_node = None
    
    # Root split: create new root
    if split_node:
        new_root = RTreeNode(is_leaf=False)
        new_root.entries = [
            RTreeEntry(mbr=compute_mbr(root), child=root),
            RTreeEntry(mbr=compute_mbr(split_node), child=split_node)
        ]
        update_mbr(new_root)
        return new_root
    
    return root

The quadratic split algorithm picks two seed entries that would waste the most area if placed together, then greedily assigns remaining entries to minimize total MBR area.

def quadratic_split(node: RTreeNode) -> RTreeNode:
    """Split node using quadratic algorithm, return new sibling."""
    entries = node.entries
    
    # Pick seeds: entries with maximum wasted area if combined
    worst_waste = -float('inf')
    seed1, seed2 = 0, 1
    
    for i in range(len(entries)):
        for j in range(i + 1, len(entries)):
            combined = entries[i].mbr.enlarge_to_include(entries[j].mbr)
            waste = combined.area() - entries[i].mbr.area() - entries[j].mbr.area()
            if waste > worst_waste:
                worst_waste = waste
                seed1, seed2 = i, j
    
    # Create two groups
    group1 = [entries[seed1]]
    group2 = [entries[seed2]]
    remaining = [e for i, e in enumerate(entries) if i not in (seed1, seed2)]
    
    mbr1 = entries[seed1].mbr
    mbr2 = entries[seed2].mbr
    
    # Assign remaining entries
    for entry in remaining:
        # Check minimum fill constraint
        if len(group1) + len(remaining) == node.min_entries:
            group1.extend(remaining)
            break
        if len(group2) + len(remaining) == node.min_entries:
            group2.extend(remaining)
            break
        
        # Assign to group requiring less enlargement
        enlarge1 = mbr1.enlarge_to_include(entry.mbr).area() - mbr1.area()
        enlarge2 = mbr2.enlarge_to_include(entry.mbr).area() - mbr2.area()
        
        if enlarge1 < enlarge2:
            group1.append(entry)
            mbr1 = mbr1.enlarge_to_include(entry.mbr)
        else:
            group2.append(entry)
            mbr2 = mbr2.enlarge_to_include(entry.mbr)
    
    node.entries = group1
    sibling = RTreeNode(is_leaf=node.is_leaf)
    sibling.entries = group2
    
    return sibling

Deletion and Tree Maintenance

Deletion requires finding the entry, removing it, and handling potential underflow. When a node drops below minimum capacity, its entries are reinserted rather than merged—this often improves tree quality.

def delete(root: RTreeNode, mbr: Rectangle, data: Any) -> RTreeNode:
    """Delete an entry, returning the (possibly new) root."""
    orphans = []
    
    def delete_recursive(node: RTreeNode, target_mbr: Rectangle) -> bool:
        for i, entry in enumerate(node.entries):
            if not entry.mbr.intersects(target_mbr):
                continue
                
            if node.is_leaf:
                if entry.data == data:
                    node.entries.pop(i)
                    return True
            else:
                if delete_recursive(entry.child, target_mbr):
                    # Check for underflow
                    if len(entry.child.entries) < node.min_entries:
                        orphans.extend(entry.child.entries)
                        node.entries.pop(i)
                    else:
                        entry.mbr = compute_mbr(entry.child)
                    return True
        return False
    
    delete_recursive(root, mbr)
    
    # Reinsert orphaned entries
    for orphan in orphans:
        if orphan.child:  # Internal node entry
            reinsert_subtree(root, orphan)
        else:  # Leaf entry
            root = insert(root, orphan.mbr, orphan.data)
    
    # Shrink tree if root has single child
    while not root.is_leaf and len(root.entries) == 1:
        root = root.entries[0].child
    
    return root

R-Tree Variants and Practical Considerations

The R*-tree improves on the original with better split algorithms and forced reinsertion. When a node overflows, instead of immediately splitting, it reinserts a portion of entries. This reorganization often reduces overlap significantly.

For bulk loading, the Sort-Tile-Recursive (STR) algorithm sorts objects by one dimension, partitions them into tiles, then recursively processes each tile. This produces trees with minimal overlap and near-optimal packing.

When to use R-Trees vs alternatives:

Quad-trees: Simpler, good for uniformly distributed points, but struggle with clustered data
K-d trees: Excellent for nearest-neighbor in moderate dimensions, but don’t handle rectangles well
R-Trees: Best general choice for rectangles, polygons, and range queries

For production use, leverage existing libraries:

# Using RTree library (pip install rtree)
from rtree import index

# Create index
idx = index.Index()

# Insert points (id, (minx, miny, maxx, maxy), object)
restaurants = [
    (1, (-73.99, 40.73, -73.99, 40.73), "Joe's Pizza"),
    (2, (-73.98, 40.75, -73.98, 40.75), "Shake Shack"),
    (3, (-74.00, 40.72, -74.00, 40.72), "Katz's Deli"),
]

for id, coords, name in restaurants:
    idx.insert(id, coords, obj=name)

# Range query: find restaurants in bounding box
nearby = list(idx.intersection((-74.0, 40.72, -73.98, 40.74), objects=True))
for item in nearby:
    print(f"{item.object} at {item.bbox}")

# Nearest neighbor
nearest = list(idx.nearest((-73.99, 40.74, -73.99, 40.74), 1, objects=True))

Choose node sizes based on your storage medium. For disk-based trees, match your page size (typically 4KB). For in-memory trees, smaller nodes (4-16 entries) often perform better due to cache efficiency.

R-Trees aren’t magic—they work best when your data has spatial locality and your queries are selective. For queries that match most of the dataset, a sequential scan might actually win.