R-Tree: Spatial Indexing Structure

B-trees excel at one-dimensional ordering. They can efficiently answer 'find all records where `created_at` is between January and March' because dates have a natural linear order. But ask a B-tree...

Key Insights

  • R-trees organize spatial data using hierarchical Minimum Bounding Rectangles (MBRs), enabling efficient multi-dimensional queries that would require full table scans with traditional B-tree indexes.
  • The critical performance factor in R-trees is minimizing MBR overlap between sibling nodes—the R*-tree variant addresses this through forced reinsertions and improved split algorithms.
  • For production systems, leverage existing implementations in PostGIS, MongoDB, or SQLite rather than building custom R-trees; understand the internals to optimize query patterns and bulk loading strategies.

Why Traditional Indexes Fail for Spatial Data

B-trees excel at one-dimensional ordering. They can efficiently answer “find all records where created_at is between January and March” because dates have a natural linear order. But ask a B-tree to find “all restaurants within 2 kilometers of my location” and it falls apart.

The fundamental problem: spatial data is inherently multi-dimensional. A location has both latitude and longitude. A rectangle has four coordinates. You can’t meaningfully sort these into a single sequence that preserves spatial proximity. Two points adjacent in a sorted list might be on opposite sides of a city.

This matters for countless applications. Maps need to render only visible features. Games must detect collisions between thousands of objects. Ride-sharing apps match drivers to nearby passengers. GIS systems query land parcels intersecting a proposed pipeline route. All of these require indexes designed for spatial relationships.

The R-tree, introduced by Antonin Guttman in 1984, solved this problem by extending B-tree concepts to multiple dimensions. Since then, variants have emerged—R*-trees improve query performance through better node organization, R+-trees eliminate overlap at the cost of duplicating entries, and Hilbert R-trees use space-filling curves for improved packing. The core R-tree concept remains foundational to all of them.

R-Tree Structure and Properties

The R-tree’s central concept is the Minimum Bounding Rectangle (MBR)—the smallest axis-aligned rectangle that completely contains a spatial object. A winding river becomes a simple box. A complex polygon becomes four coordinates. This simplification enables efficient comparisons.

The tree structure mirrors B-trees. Leaf nodes store actual spatial objects (or references to them) along with their MBRs. Internal nodes store MBRs that bound all objects in their subtrees. The root bounds everything in the tree.

from dataclasses import dataclass, field
from typing import List, Optional, Union

@dataclass
class MBR:
    min_x: float
    min_y: float
    max_x: float
    max_y: float
    
    def area(self) -> float:
        return (self.max_x - self.min_x) * (self.max_y - self.min_y)
    
    def enlargement_needed(self, other: 'MBR') -> float:
        merged = self.merge(other)
        return merged.area() - self.area()
    
    def merge(self, other: 'MBR') -> 'MBR':
        return MBR(
            min(self.min_x, other.min_x),
            min(self.min_y, other.min_y),
            max(self.max_x, other.max_x),
            max(self.max_y, other.max_y)
        )
    
    def intersects(self, other: 'MBR') -> bool:
        return not (self.max_x < other.min_x or other.max_x < self.min_x or
                    self.max_y < other.min_y or other.max_y < self.min_y)

@dataclass
class RTreeNode:
    entries: List[tuple] = field(default_factory=list)  # (MBR, child_or_data)
    is_leaf: bool = True
    max_entries: int = 4
    min_entries: int = 2
    
    def is_full(self) -> bool:
        return len(self.entries) >= self.max_entries
    
    def compute_mbr(self) -> Optional[MBR]:
        if not self.entries:
            return None
        result = self.entries[0][0]
        for mbr, _ in self.entries[1:]:
            result = result.merge(mbr)
        return result

R-trees maintain balance—all leaves sit at the same depth. Nodes must contain between m and M entries (typically m = M/2), except the root which can have as few as two children. This guarantees logarithmic height relative to the number of entries.

Search Operations

Search exploits the hierarchical MBR structure. For a range query, start at the root and recursively descend into any child whose MBR intersects the search region. Unlike B-trees, multiple paths may need exploration—overlapping MBRs mean the target could exist in several subtrees.

class RTree:
    def __init__(self, max_entries: int = 4):
        self.root = RTreeNode(max_entries=max_entries)
        self.max_entries = max_entries
        self.min_entries = max_entries // 2
    
    def search(self, query_mbr: MBR) -> List:
        """Find all entries intersecting the query region."""
        results = []
        self._search_recursive(self.root, query_mbr, results)
        return results
    
    def _search_recursive(self, node: RTreeNode, query: MBR, results: List):
        for mbr, child_or_data in node.entries:
            if mbr.intersects(query):
                if node.is_leaf:
                    results.append(child_or_data)
                else:
                    self._search_recursive(child_or_data, query, results)
    
    def nearest_neighbor(self, point_x: float, point_y: float, k: int = 1) -> List:
        """Find k nearest neighbors using priority queue."""
        import heapq
        
        point_mbr = MBR(point_x, point_y, point_x, point_y)
        
        # Priority queue: (distance, is_leaf_entry, node_or_data)
        heap = [(0, False, self.root)]
        results = []
        
        while heap and len(results) < k:
            dist, is_data, item = heapq.heappop(heap)
            
            if is_data:
                results.append((dist, item))
            else:
                node = item
                for mbr, child_or_data in node.entries:
                    min_dist = self._min_distance(point_x, point_y, mbr)
                    if node.is_leaf:
                        heapq.heappush(heap, (min_dist, True, child_or_data))
                    else:
                        heapq.heappush(heap, (min_dist, False, child_or_data))
        
        return results
    
    def _min_distance(self, px: float, py: float, mbr: MBR) -> float:
        dx = max(mbr.min_x - px, 0, px - mbr.max_x)
        dy = max(mbr.min_y - py, 0, py - mbr.max_y)
        return (dx * dx + dy * dy) ** 0.5

Nearest-neighbor search uses a priority queue ordered by minimum possible distance to each MBR. This prunes branches that can’t contain closer results than already found.

Insertion Algorithm

Insertion follows the path of least resistance—specifically, the path requiring minimum MBR enlargement. When a leaf overflows, it splits, potentially propagating splits upward.

def insert(self, mbr: MBR, data) -> None:
    """Insert a new entry into the R-tree."""
    leaf = self._choose_leaf(self.root, mbr)
    leaf.entries.append((mbr, data))
    
    if leaf.is_full():
        new_node = self._split_node(leaf)
        self._adjust_tree(leaf, new_node)
    else:
        self._adjust_tree(leaf, None)

def _choose_leaf(self, node: RTreeNode, mbr: MBR) -> RTreeNode:
    """Descend to the leaf that needs least enlargement."""
    if node.is_leaf:
        return node
    
    best_child = None
    best_enlargement = float('inf')
    best_area = float('inf')
    
    for child_mbr, child_node in node.entries:
        enlargement = child_mbr.enlargement_needed(mbr)
        area = child_mbr.area()
        
        if enlargement < best_enlargement or \
           (enlargement == best_enlargement and area < best_area):
            best_enlargement = enlargement
            best_area = area
            best_child = child_node
    
    return self._choose_leaf(best_child, mbr)

def _split_node(self, node: RTreeNode) -> RTreeNode:
    """Quadratic split: pick seeds that waste most area together."""
    entries = node.entries[:]
    
    # Find two entries that would waste most space together
    worst_waste = -float('inf')
    seed1_idx, seed2_idx = 0, 1
    
    for i in range(len(entries)):
        for j in range(i + 1, len(entries)):
            merged = entries[i][0].merge(entries[j][0])
            waste = merged.area() - entries[i][0].area() - entries[j][0].area()
            if waste > worst_waste:
                worst_waste = waste
                seed1_idx, seed2_idx = i, j
    
    # Initialize two groups
    group1 = [entries[seed1_idx]]
    group2 = [entries[seed2_idx]]
    remaining = [e for k, e in enumerate(entries) 
                 if k not in (seed1_idx, seed2_idx)]
    
    # Assign remaining entries
    while remaining:
        # Check if one group needs all remaining to meet minimum
        if len(group1) + len(remaining) == self.min_entries:
            group1.extend(remaining)
            break
        if len(group2) + len(remaining) == self.min_entries:
            group2.extend(remaining)
            break
        
        # Pick entry with greatest preference for one group
        mbr1 = self._compute_group_mbr(group1)
        mbr2 = self._compute_group_mbr(group2)
        
        best_entry = None
        best_diff = -float('inf')
        best_group = None
        
        for entry in remaining:
            d1 = mbr1.enlargement_needed(entry[0])
            d2 = mbr2.enlargement_needed(entry[0])
            diff = abs(d1 - d2)
            
            if diff > best_diff:
                best_diff = diff
                best_entry = entry
                best_group = group1 if d1 < d2 else group2
        
        best_group.append(best_entry)
        remaining.remove(best_entry)
    
    node.entries = group1
    new_node = RTreeNode(is_leaf=node.is_leaf, max_entries=self.max_entries)
    new_node.entries = group2
    return new_node

The quadratic split algorithm picks two seed entries that would waste the most area if placed together, then assigns remaining entries based on which group they’d enlarge least. Linear split is faster but produces worse trees.

Deletion and Tree Maintenance

Deletion finds the target entry, removes it, and handles underflow by reinserting orphaned entries rather than merging nodes.

def delete(self, mbr: MBR, data) -> bool:
    """Delete an entry and reinsert orphans if underflow occurs."""
    leaf = self._find_leaf(self.root, mbr, data)
    if leaf is None:
        return False
    
    # Remove entry
    leaf.entries = [(m, d) for m, d in leaf.entries 
                    if not (m == mbr and d == data)]
    
    orphans = []
    self._condense_tree(leaf, orphans)
    
    # Reinsert orphaned entries
    for orphan_mbr, orphan_data in orphans:
        self.insert(orphan_mbr, orphan_data)
    
    return True

def _condense_tree(self, node: RTreeNode, orphans: List):
    """Propagate underflow handling up the tree."""
    # Simplified: in practice, track parent pointers or path
    if len(node.entries) < self.min_entries and node != self.root:
        # Collect all entries from this node as orphans
        orphans.extend(node.entries)
        node.entries = []  # Mark for removal from parent

Reinsertion often produces better tree organization than merging because entries get placed in optimal locations based on current tree state.

Performance and Bulk Loading

Building an R-tree one insertion at a time produces suboptimal structure. Bulk loading techniques construct better trees from known datasets.

def bulk_load_str(entries: List[tuple], max_entries: int = 4) -> RTree:
    """Sort-Tile-Recursive bulk loading."""
    import math
    
    n = len(entries)
    leaf_count = math.ceil(n / max_entries)
    slice_count = math.ceil(math.sqrt(leaf_count))
    
    # Sort by x-coordinate center
    entries.sort(key=lambda e: (e[0].min_x + e[0].max_x) / 2)
    
    # Divide into vertical slices, sort each by y
    slice_size = math.ceil(n / slice_count)
    leaves = []
    
    for i in range(0, n, slice_size):
        slice_entries = entries[i:i + slice_size]
        slice_entries.sort(key=lambda e: (e[0].min_y + e[0].max_y) / 2)
        
        for j in range(0, len(slice_entries), max_entries):
            node = RTreeNode(max_entries=max_entries)
            node.entries = slice_entries[j:j + max_entries]
            leaves.append(node)
    
    # Build tree bottom-up
    return _build_levels(leaves, max_entries)

Sort-Tile-Recursive (STR) sorts entries by one dimension, partitions into slices, sorts each slice by the other dimension, then packs into leaves. This minimizes overlap and produces trees with excellent query performance.

Practical Usage with PostGIS

In production, use battle-tested implementations. PostGIS provides R-tree indexes via GiST:

-- Create spatial table
CREATE TABLE restaurants (
    id SERIAL PRIMARY KEY,
    name TEXT,
    location GEOGRAPHY(POINT, 4326)
);

-- Create spatial index (uses R-tree internally)
CREATE INDEX idx_restaurants_location 
ON restaurants USING GIST(location);

-- Find restaurants within 2km
SELECT name, ST_Distance(location, ST_MakePoint(-122.4194, 37.7749)::geography) as dist
FROM restaurants
WHERE ST_DWithin(
    location, 
    ST_MakePoint(-122.4194, 37.7749)::geography,
    2000  -- meters
)
ORDER BY dist;

-- Analyze index usage
EXPLAIN ANALYZE SELECT * FROM restaurants
WHERE ST_DWithin(location, ST_MakePoint(-122.4194, 37.7749)::geography, 2000);

Choose R-trees when you need range queries on bounding boxes or complex geometries. Use quadtrees for point-heavy datasets with frequent updates. Consider geohashing when you need simple proximity queries and can tolerate approximate results. K-d trees work well for in-memory nearest-neighbor searches on static datasets.

R-trees remain the workhorse of spatial databases because they handle the general case well: mixed geometry types, efficient disk access patterns, and robust query performance across diverse spatial predicates.

Liked this? There's more.

Every week: one practical technique, explained simply, with code you can use immediately.