NumPy - np.searchsorted() - Binary Search

Key Insights

• np.searchsorted() performs binary search on sorted arrays in O(log n) time, returning insertion indices that maintain sorted order—dramatically faster than linear search for large datasets • The function supports both scalar and array inputs, with side='left' (default) finding the leftmost insertion point and side='right' finding the rightmost, critical for handling duplicate values • Common applications include bucketing continuous data, finding range queries, and implementing efficient lookups in time-series or sorted reference data

Understanding Binary Search in NumPy

np.searchsorted() implements binary search to find indices where elements should be inserted into a sorted array to maintain order. Unlike Python’s bisect module which works on lists, NumPy’s implementation operates on arrays and supports vectorized operations.

import numpy as np

# Basic usage
arr = np.array([1, 3, 5, 7, 9])
index = np.searchsorted(arr, 6)
print(f"Insert 6 at index: {index}")  # Output: 3

# Verify insertion maintains order
result = np.insert(arr, index, 6)
print(result)  # [1 3 5 6 7 9]

The function assumes the input array is sorted in ascending order. For descending arrays, you must use the sorter parameter.

Left vs Right Side Behavior

The side parameter controls behavior with duplicate values. side='left' returns the first position, while side='right' returns the position after the last occurrence.

arr = np.array([1, 3, 3, 3, 5, 7, 9])

# Find insertion points for value 3
left_idx = np.searchsorted(arr, 3, side='left')
right_idx = np.searchsorted(arr, 3, side='right')

print(f"Left side: {left_idx}")   # Output: 1
print(f"Right side: {right_idx}")  # Output: 4

# This gives us the range of 3's in the array
count = right_idx - left_idx
print(f"Count of 3's: {count}")  # Output: 3

This behavior is essential for range queries and counting occurrences in sorted data.

Vectorized Search Operations

Unlike scalar searches, np.searchsorted() accepts arrays as the search values, performing multiple binary searches efficiently.

sorted_arr = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90])
values_to_find = np.array([15, 35, 55, 95])

indices = np.searchsorted(sorted_arr, values_to_find)
print(indices)  # [1 3 5 9]

# Practical example: classify values into ranges
for val, idx in zip(values_to_find, indices):
    if idx == 0:
        print(f"{val}: below minimum")
    elif idx == len(sorted_arr):
        print(f"{val}: above maximum")
    else:
        print(f"{val}: between {sorted_arr[idx-1]} and {sorted_arr[idx]}")

Output:

15: between 10 and 20
35: between 30 and 40
55: between 50 and 60
95: above maximum

Binning Continuous Data

A powerful application is discretizing continuous data into bins—similar to pd.cut() but with more control.

# Define bin edges
bins = np.array([0, 10, 20, 30, 40, 50])
data = np.array([5, 15, 25, 35, 45, 3, 18, 42])

# Find which bin each value falls into
bin_indices = np.searchsorted(bins, data, side='right') - 1

print("Data:", data)
print("Bin indices:", bin_indices)

# Create labels for bins
labels = ['0-10', '10-20', '20-30', '30-40', '40-50']
for val, bin_idx in zip(data, bin_indices):
    if 0 <= bin_idx < len(labels):
        print(f"{val} -> {labels[bin_idx]}")

This approach handles edge cases explicitly and performs faster than histogram-based methods for large datasets.

Time Series Range Queries

When working with sorted timestamps, np.searchsorted() enables efficient range queries without scanning the entire array.

# Simulated timestamp data (sorted)
timestamps = np.array([
    1609459200, 1609545600, 1609632000, 1609718400,
    1609804800, 1609891200, 1609977600, 1610064000
])

# Find data between two timestamps
start_time = 1609632000
end_time = 1609891200

start_idx = np.searchsorted(timestamps, start_time, side='left')
end_idx = np.searchsorted(timestamps, end_time, side='right')

print(f"Range indices: [{start_idx}:{end_idx}]")
print(f"Timestamps in range: {timestamps[start_idx:end_idx]}")

# This is O(log n) instead of O(n) for filtering

For millions of records, this difference is substantial—binary search scales logarithmically while linear filtering scales linearly.

Working with Unsorted Data

If your array isn’t sorted, use the sorter parameter with np.argsort() to get correct indices.

unsorted = np.array([30, 10, 50, 20, 40])
sorter = np.argsort(unsorted)
sorted_arr = unsorted[sorter]

print("Sorted array:", sorted_arr)  # [10 20 30 40 50]

# Search in the unsorted array using sorter
values = np.array([15, 35])
indices_in_sorted = np.searchsorted(sorted_arr, values)

# Map back to original indices
original_indices = sorter[indices_in_sorted]
print("Indices in original array:", original_indices)

However, if you frequently search the same array, sorting once and keeping the sorted version is more efficient.

Performance Comparison

Binary search provides significant speedup over linear search, especially for large arrays.

import time

# Create large sorted array
large_arr = np.arange(0, 1000000, 2)
search_values = np.random.randint(0, 1000000, 1000)

# Binary search with searchsorted
start = time.perf_counter()
result_binary = np.searchsorted(large_arr, search_values)
binary_time = time.perf_counter() - start

# Linear search equivalent
start = time.perf_counter()
result_linear = np.array([np.where(large_arr >= val)[0][0] 
                          if np.any(large_arr >= val) else len(large_arr)
                          for val in search_values])
linear_time = time.perf_counter() - start

print(f"Binary search: {binary_time:.6f}s")
print(f"Linear search: {linear_time:.6f}s")
print(f"Speedup: {linear_time/binary_time:.1f}x")

On typical hardware, expect 100-1000x speedup for arrays with millions of elements.

Practical Pattern: Lookup Tables

Implement efficient lookups using sorted keys and corresponding values.

# Sorted keys and corresponding values
keys = np.array([100, 200, 300, 400, 500])
values = np.array(['A', 'B', 'C', 'D', 'E'])

def lookup(query_keys):
    indices = np.searchsorted(keys, query_keys)
    # Handle out of bounds
    indices = np.clip(indices, 0, len(keys) - 1)
    
    # Check exact matches
    exact_match = keys[indices] == query_keys
    result = np.where(exact_match, values[indices], 'Not Found')
    return result

queries = np.array([100, 250, 400, 600])
results = lookup(queries)
print(dict(zip(queries, results)))
# {100: 'A', 250: 'Not Found', 400: 'D', 600: 'Not Found'}

This pattern works well for configuration lookups, grade boundaries, or any key-value mapping with sorted keys.

Edge Cases and Validation

Always validate assumptions about sorted order and handle boundary conditions.

def safe_searchsorted(arr, values):
    # Verify array is sorted
    if not np.all(arr[:-1] <= arr[1:]):
        raise ValueError("Array must be sorted")
    
    indices = np.searchsorted(arr, values)
    
    # Identify out-of-bounds insertions
    below_min = indices == 0
    above_max = indices == len(arr)
    
    return {
        'indices': indices,
        'below_min': np.sum(below_min),
        'above_max': np.sum(above_max),
        'in_range': np.sum(~below_min & ~above_max)
    }

arr = np.array([5, 10, 15, 20])
vals = np.array([3, 7, 12, 25])
result = safe_searchsorted(arr, vals)
print(result)

This defensive approach prevents silent errors in production systems where data assumptions may not hold.