NumPy - Boolean/Mask Indexing | Application Architect

Key Insights

Boolean indexing allows you to filter NumPy arrays using conditional expressions, returning elements where the condition evaluates to True without explicit loops
Mask indexing enables complex data filtering, outlier removal, and conditional operations in a single line while maintaining performance through vectorized operations
Combined boolean conditions using logical operators (&, |, ~) require parentheses due to operator precedence, a common source of errors when filtering multi-dimensional arrays

Understanding Boolean Indexing Fundamentals

Boolean indexing in NumPy uses arrays of True/False values to select elements from another array. When you apply a conditional expression to a NumPy array, it returns a boolean array of the same shape, which you can then use as an index.

import numpy as np

# Create a sample array
arr = np.array([10, 25, 30, 45, 50, 65, 70])

# Create a boolean mask
mask = arr > 40
print(mask)  # [False False False  True  True  True  True]

# Apply the mask to get filtered values
filtered = arr[mask]
print(filtered)  # [45 50 65 70]

# Or do it in one line
result = arr[arr > 40]
print(result)  # [45 50 65 70]

The boolean array acts as a filter, selecting only elements where the corresponding mask value is True. This approach is significantly faster than Python loops because the operation is vectorized at the C level.

Working with Multi-Dimensional Arrays

Boolean indexing becomes more powerful with multi-dimensional arrays. The mask must either match the array’s shape or be broadcastable to it.

# 2D array example
matrix = np.array([[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [9, 10, 11, 12]])

# Boolean mask with same shape
mask = matrix > 6
print(mask)
# [[False False False False]
#  [False False  True  True]
#  [ True  True  True  True]]

# Returns a 1D array of matching elements
filtered = matrix[mask]
print(filtered)  # [ 7  8  9 10 11 12]

# Row-wise filtering using boolean array
row_mask = np.array([True, False, True])
selected_rows = matrix[row_mask]
print(selected_rows)
# [[ 1  2  3  4]
#  [ 9 10 11 12]]

Note that applying a 2D boolean mask to a 2D array returns a 1D array containing only the selected elements. This is because the mask selects individual elements, not preserving the original structure.

Combining Multiple Conditions

Real-world filtering often requires multiple conditions. NumPy uses bitwise operators (&, |, ~) for element-wise logical operations. Parentheses are mandatory due to operator precedence.

data = np.array([15, 22, 8, 35, 42, 18, 50, 3])

# Multiple conditions with AND
result = data[(data > 10) & (data < 40)]
print(result)  # [15 22 35 18]

# Multiple conditions with OR
result = data[(data < 10) | (data > 40)]
print(result)  # [ 8 42 50  3]

# Negation with NOT
result = data[~(data > 30)]
print(result)  # [15 22  8 18  3]

# Complex conditions
result = data[((data > 10) & (data < 30)) | (data > 45)]
print(result)  # [15 22 18 50]

Common mistake: forgetting parentheses around individual conditions leads to unexpected results due to operator precedence. Always wrap each condition in parentheses when combining them.

Modifying Values with Boolean Indexing

Boolean indexing isn’t just for reading data—you can modify array elements in place using mask-based assignment.

temps = np.array([72, 68, 95, 101, 88, 75, 105, 82])

# Replace outliers with a threshold
temps[temps > 100] = 100
print(temps)  # [ 72  68  95 100  88  75 100  82]

# Conditional replacement with different logic
prices = np.array([25.5, 30.0, 15.75, 42.0, 18.5])
prices[prices < 20] = prices[prices < 20] * 1.1  # 10% increase
print(prices)  # [25.5  30.   17.325 42.   20.35 ]

# Multiple condition updates
scores = np.array([45, 67, 89, 23, 91, 56, 78])
scores[(scores >= 60) & (scores < 80)] += 5  # Curve the middle range
print(scores)  # [45 72 89 23 91 61 83]

This pattern is extremely efficient for data cleaning, normalization, and conditional transformations without loops.

Practical Applications: Data Filtering

Boolean indexing excels at real-world data filtering scenarios. Here’s how to handle common data science tasks:

# Simulated dataset: [temperature, humidity, pressure]
sensor_data = np.array([
    [72.5, 45, 1013],
    [68.2, 52, 1015],
    [95.8, 78, 1008],
    [101.3, 82, 1005],
    [75.0, 48, 1012],
    [88.5, 65, 1010]
])

# Filter rows where temperature is comfortable (70-85°F)
comfortable = sensor_data[(sensor_data[:, 0] >= 70) & 
                          (sensor_data[:, 0] <= 85)]
print(comfortable)
# [[72.5 45. 1013.]
#  [75.  48. 1012.]]

# Find records with high temperature OR low pressure
alerts = sensor_data[(sensor_data[:, 0] > 90) | 
                     (sensor_data[:, 2] < 1010)]
print(alerts)
# [[ 95.8  78. 1008.]
#  [101.3  82. 1005.]]

# Remove outliers using IQR method
temps = sensor_data[:, 0]
q1, q3 = np.percentile(temps, [25, 75])
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

clean_data = sensor_data[(temps >= lower_bound) & 
                         (temps <= upper_bound)]
print(clean_data.shape)

Using np.where() for Conditional Selection

The np.where() function provides an alternative approach for conditional operations, particularly useful when you need different values based on conditions.

values = np.array([10, 25, 30, 45, 50])

# Basic np.where: returns indices where condition is True
indices = np.where(values > 30)
print(indices)  # (array([3, 4]),)
print(values[indices])  # [45 50]

# Conditional replacement: np.where(condition, value_if_true, value_if_false)
result = np.where(values > 30, values, 0)
print(result)  # [ 0  0  0 45 50]

# Multiple conditions with nested np.where
categories = np.where(values < 20, 'low',
                     np.where(values < 40, 'medium', 'high'))
print(categories)  # ['low' 'medium' 'medium' 'high' 'high']

# Practical example: cap and floor values
processed = np.where(values < 20, 20,
                    np.where(values > 40, 40, values))
print(processed)  # [20 25 30 40 40]

Performance Considerations

Boolean indexing is fast, but understanding its performance characteristics helps optimize data processing pipelines.

import time

# Large array performance test
large_array = np.random.randint(0, 100, size=10_000_000)

# Boolean indexing
start = time.time()
result1 = large_array[large_array > 50]
bool_time = time.time() - start

# Equivalent list comprehension (much slower)
start = time.time()
result2 = np.array([x for x in large_array if x > 50])
loop_time = time.time() - start

print(f"Boolean indexing: {bool_time:.4f}s")
print(f"List comprehension: {loop_time:.4f}s")
print(f"Speedup: {loop_time/bool_time:.1f}x")

# Memory considerations: boolean masks have overhead
mask = large_array > 50  # Creates a 10M element boolean array
print(f"Mask memory: {mask.nbytes / 1_000_000:.2f} MB")

Boolean indexing creates intermediate boolean arrays, which consume memory. For extremely large datasets, consider processing in chunks or using np.where() to get indices directly.

Advanced Pattern: Fancy Indexing with Masks

Combine boolean indexing with fancy indexing for sophisticated data selection.

# Dataset with multiple features
data = np.random.randn(1000, 5)  # 1000 samples, 5 features

# Select specific columns where row condition is met
row_mask = data[:, 0] > 0  # Positive values in first column
selected_features = data[row_mask][:, [0, 2, 4]]  # Columns 0, 2, 4

# More efficient: combine in single operation
result = data[row_mask, :][:, [0, 2, 4]]

# Or use np.ix_ for cleaner syntax
row_indices = np.where(data[:, 0] > 0)[0]
col_indices = [0, 2, 4]
result = data[np.ix_(row_indices, col_indices)]

print(result.shape)  # (n_positive_rows, 3)

Boolean indexing is a cornerstone of efficient NumPy programming. Master these patterns to write concise, performant data processing code that eliminates explicit loops while maintaining readability.