Pigeonhole Sort: Integer Sorting for Small Ranges
Pigeonhole sort is a non-comparison sorting algorithm based on the pigeonhole principle: if you have n items and k containers, and n > k, at least one container must hold more than one item. The...
Key Insights
- Pigeonhole sort achieves O(n + range) time complexity, making it faster than comparison-based sorts when the range of values is small relative to the number of elements.
- The algorithm trades space for speed, requiring O(range) auxiliary memory—a worthwhile tradeoff only when your data falls within a known, bounded integer range.
- Choose pigeonhole sort over counting sort when you need to preserve the original objects (not just counts), and over bucket sort when dealing with discrete integers rather than continuous distributions.
Introduction to Pigeonhole Sort
Pigeonhole sort is a non-comparison sorting algorithm based on the pigeonhole principle: if you have n items and k containers, and n > k, at least one container must hold more than one item. The algorithm exploits this by creating a “hole” (bucket) for each possible value in your data’s range, then distributing elements into their corresponding holes.
This sounds simple because it is. The elegance lies in knowing when to use it.
General-purpose sorts like quicksort and mergesort guarantee O(n log n) performance regardless of your data’s characteristics. But when you know your integers fall within a small, bounded range—say, sorting 10,000 exam scores between 0 and 100—pigeonhole sort completes in linear time. That’s not a marginal improvement; it’s a fundamental complexity class difference.
The catch? Your data must cooperate. Pigeonhole sort demands integer keys (or values mappable to integers) within a range that fits comfortably in memory. Step outside these constraints, and you’re better off with conventional algorithms.
How Pigeonhole Sort Works
The algorithm follows four distinct phases:
- Find the range: Scan the input to determine minimum and maximum values
- Create holes: Allocate an array of empty lists, one for each possible value in the range
- Distribute: Place each element into its corresponding hole based on its value
- Collect: Traverse the holes in order, extracting elements back into the result array
Here’s a clean implementation in Python:
def pigeonhole_sort(arr):
if not arr:
return arr
# Phase 1: Find the range
min_val = min(arr)
max_val = max(arr)
range_size = max_val - min_val + 1
# Phase 2: Create holes (list of lists for stability)
holes = [[] for _ in range(range_size)]
# Phase 3: Distribute elements into holes
for item in arr:
hole_index = item - min_val
holes[hole_index].append(item)
# Phase 4: Collect elements from holes
result = []
for hole in holes:
result.extend(hole)
return result
# Example usage
data = [8, 3, 2, 7, 4, 6, 8, 3, 5]
sorted_data = pigeonhole_sort(data)
print(sorted_data) # [2, 3, 3, 4, 5, 6, 7, 8, 8]
Notice that each hole is a list, not a counter. This preserves duplicate values and maintains stability—elements with equal keys appear in their original relative order. If you only need to sort primitive integers without caring about stability, you could simplify to counting occurrences, but the list approach handles objects with integer keys.
Time and Space Complexity Analysis
Understanding pigeonhole sort’s complexity requires separating two variables: n (number of elements) and k (range size, where k = max - min + 1).
Time Complexity: O(n + k)
- Finding min/max: O(n)
- Creating holes array: O(k)
- Distributing elements: O(n)
- Collecting results: O(n + k) — we traverse all holes, some empty
The total is O(n + k). When k is O(n) or smaller, this simplifies to O(n)—linear time sorting.
Space Complexity: O(k)
The holes array dominates space usage. Each hole is a list that collectively holds n elements, but the array structure itself requires k slots.
Here’s where the critical insight emerges: the algorithm’s efficiency depends entirely on the relationship between n and k.
| Scenario | Range (k) | Elements (n) | Effective Complexity |
|---|---|---|---|
| Ideal | 100 | 10,000 | O(n) |
| Acceptable | 10,000 | 10,000 | O(n) |
| Problematic | 1,000,000 | 10,000 | O(k) — worse than O(n log n) |
When k » n, you’re allocating massive arrays with mostly empty slots, and the collection phase wastes cycles traversing them. At some threshold, quicksort’s O(n log n) becomes faster despite the higher complexity class.
Ideal Use Cases
Pigeonhole sort shines in specific domains where data characteristics align with its requirements.
Sorting ages in HR systems: Human ages cluster between 18 and 70 for most employee datasets. With a range of ~52 values, pigeonhole sort handles millions of records efficiently.
from dataclasses import dataclass
from typing import List
@dataclass
class Employee:
id: int
name: str
age: int
department: str
def sort_employees_by_age(employees: List[Employee]) -> List[Employee]:
if not employees:
return employees
# Age range: 18-70 for typical workforce
MIN_AGE = 18
MAX_AGE = 70
range_size = MAX_AGE - MIN_AGE + 1
# Create holes for each possible age
holes = [[] for _ in range(range_size)]
# Distribute employees into age buckets
for emp in employees:
if MIN_AGE <= emp.age <= MAX_AGE:
hole_index = emp.age - MIN_AGE
holes[hole_index].append(emp)
# Collect sorted employees
sorted_employees = []
for hole in holes:
sorted_employees.extend(hole)
return sorted_employees
# Usage
employees = [
Employee(1, "Alice", 32, "Engineering"),
Employee(2, "Bob", 28, "Sales"),
Employee(3, "Carol", 32, "Engineering"),
Employee(4, "Dave", 45, "Management"),
Employee(5, "Eve", 28, "Marketing"),
]
sorted_emps = sort_employees_by_age(employees)
for emp in sorted_emps:
print(f"{emp.name}: {emp.age}")
# Bob: 28
# Eve: 28
# Alice: 32
# Carol: 32
# Dave: 45
Other ideal scenarios include:
- Exam scores: 0-100 range, potentially thousands of students
- Sensor readings: Discretized temperature, humidity, or pressure values within known bounds
- Priority queues: When priorities are small integers (1-10 priority levels)
- Histogram generation: Counting occurrences within fixed bins
Comparison with Related Algorithms
Pigeonhole sort belongs to a family of distribution-based sorting algorithms. Understanding the distinctions helps you choose correctly.
Counting Sort counts occurrences of each value rather than storing elements in buckets. It’s more memory-efficient for primitives but loses the ability to sort objects with keys. Use counting sort when you only need sorted primitive integers.
Bucket Sort divides elements into buckets based on value ranges, then sorts each bucket individually (often with insertion sort). It handles floating-point numbers and continuous distributions. Use bucket sort when values aren’t discrete integers.
Radix Sort processes integers digit by digit, using a stable sort (often counting sort) at each position. It handles larger ranges efficiently by breaking them into smaller digit-based passes. Use radix sort when the range is large but values have bounded digit counts.
Here’s a benchmark comparing these approaches:
import time
import random
def benchmark_sorts(n: int, value_range: int, iterations: int = 5):
"""Compare sorting algorithms on random integer data."""
def counting_sort(arr):
if not arr:
return arr
min_val, max_val = min(arr), max(arr)
counts = [0] * (max_val - min_val + 1)
for x in arr:
counts[x - min_val] += 1
result = []
for i, count in enumerate(counts):
result.extend([i + min_val] * count)
return result
def pigeonhole_sort(arr):
if not arr:
return arr
min_val, max_val = min(arr), max(arr)
holes = [[] for _ in range(max_val - min_val + 1)]
for x in arr:
holes[x - min_val].append(x)
result = []
for hole in holes:
result.extend(hole)
return result
results = {}
for name, sort_func in [
("Pigeonhole", pigeonhole_sort),
("Counting", counting_sort),
("Built-in (Timsort)", sorted),
]:
times = []
for _ in range(iterations):
data = [random.randint(0, value_range) for _ in range(n)]
start = time.perf_counter()
sort_func(data)
times.append(time.perf_counter() - start)
results[name] = sum(times) / len(times)
print(f"n={n:,}, range={value_range:,}")
for name, avg_time in results.items():
print(f" {name}: {avg_time*1000:.2f}ms")
# Small range (pigeonhole/counting excel)
benchmark_sorts(100_000, 100)
# Large range (built-in wins)
benchmark_sorts(100_000, 10_000_000)
Typical output shows pigeonhole and counting sort dominating at small ranges, while Timsort catches up as ranges grow.
Limitations and Pitfalls
Pigeonhole sort fails—sometimes spectacularly—in several scenarios.
Sparse distributions: Sorting 1,000 elements with values between 1 and 1,000,000 creates a million-slot array with 99.9% empty holes. You’re paying O(k) to traverse empty buckets.
Large ranges: Even with dense data, sorting 32-bit integers requires 4 billion holes. That’s 16+ GB just for the array structure before storing any elements.
Floating-point data: The algorithm requires discrete integer keys. You could multiply floats to convert them, but precision loss and range explosion make this impractical.
Memory-constrained environments: Embedded systems or serverless functions with tight memory limits can’t afford O(range) auxiliary space.
Unknown or unbounded ranges: If you can’t guarantee the range at compile time or startup, you risk runtime failures or degraded performance.
Negative value handling: The implementation must offset values correctly. A common bug is forgetting to subtract min_val when indexing into holes.
Conclusion
Pigeonhole sort occupies a narrow but valuable niche in your algorithmic toolkit. It delivers linear-time sorting when three conditions align: integer keys, small ranges, and sufficient memory for the holes array.
Use it when you’re sorting bounded integer data—ages, scores, priorities, discretized measurements—and the range is at most a small multiple of your element count. Avoid it when ranges are large, distributions are sparse, or memory is constrained.
The decision framework is straightforward:
- Are keys integers (or cleanly mappable to integers)?
- Is the range known and bounded?
- Is range ≤ n, or at least range ≤ n log n?
- Can you afford O(range) memory?
Four yes answers mean pigeonhole sort is worth considering. Any no, and you should reach for counting sort, radix sort, or a comparison-based algorithm instead.
Like most specialized algorithms, pigeonhole sort’s value lies not in frequency of use but in dramatic performance gains when conditions are right. Know when to deploy it, and you’ll have a powerful tool for the right problems.