Counting Bloom Filter: Deletion Support

Key Insights

Counting Bloom filters replace single bits with counters (typically 4-bit), enabling deletion by decrementing instead of clearing bits—solving the collision problem that makes standard Bloom filters append-only.
The 4x memory overhead compared to standard Bloom filters is justified when your use case requires dynamic set membership with deletions, such as cache invalidation or session tracking.
Counter overflow is a real concern: once a counter saturates, you must either freeze it (losing deletion accuracy) or implement a more complex scaling strategy.

The Deletion Problem in Standard Bloom Filters

Standard Bloom filters have a fundamental limitation: they don’t support deletion. When you insert an element, multiple hash functions set several bits to 1. The problem arises because different elements can share bit positions. If element A and element B both set bit position 7, clearing that bit when deleting A would corrupt the filter’s knowledge of B.

This isn’t a bug—it’s an inherent property of the data structure. The space efficiency of Bloom filters comes precisely from this bit sharing. But when your application needs to remove elements from the set, you’re stuck.

Counting Bloom filters solve this by replacing each single bit with a counter. Instead of setting a bit to 1, you increment a counter. Instead of clearing a bit, you decrement it. A position is considered “set” when its counter is greater than zero. This simple change enables deletion while preserving the probabilistic membership testing that makes Bloom filters useful.

How Counting Bloom Filters Work

The core insight is straightforward: track how many elements have “claimed” each position. When you insert an element, increment the counters at all positions indicated by your hash functions. When you delete, decrement them. A lookup checks if all relevant counters are non-zero.

Most implementations use 4-bit counters, giving you values from 0 to 15. This choice balances memory efficiency against overflow risk. With well-distributed hash functions and appropriate sizing, counters rarely exceed 15 in practice.

Here’s the basic structure:

import hashlib
import math
from typing import List

class CountingBloomFilter:
    def __init__(self, expected_items: int, false_positive_rate: float):
        # Calculate optimal size and hash count
        self.size = self._optimal_size(expected_items, false_positive_rate)
        self.hash_count = self._optimal_hash_count(self.size, expected_items)
        
        # 4-bit counters stored as integers (0-15 range)
        self.counters: List[int] = [0] * self.size
        self.max_counter = 15  # 4-bit maximum
        self.item_count = 0
    
    def _optimal_size(self, n: int, p: float) -> int:
        """Calculate optimal bit array size."""
        m = -(n * math.log(p)) / (math.log(2) ** 2)
        return int(math.ceil(m))
    
    def _optimal_hash_count(self, m: int, n: int) -> int:
        """Calculate optimal number of hash functions."""
        k = (m / n) * math.log(2)
        return int(math.ceil(k))
    
    def _get_positions(self, item: str) -> List[int]:
        """Generate hash positions for an item."""
        positions = []
        for i in range(self.hash_count):
            # Double hashing technique
            hash_input = f"{item}:{i}".encode('utf-8')
            hash_value = int(hashlib.sha256(hash_input).hexdigest(), 16)
            positions.append(hash_value % self.size)
        return positions

The _get_positions method uses double hashing to generate multiple positions from a single hash function. This is more efficient than computing entirely separate hash functions while maintaining good distribution properties.

Core Operations Implementation

The three core operations—insert, delete, and lookup—map directly to counter manipulation. Insert increments, delete decrements, and lookup checks for non-zero values.

def insert(self, item: str) -> bool:
    """
    Insert an item into the filter.
    Returns False if any counter would overflow.
    """
    positions = self._get_positions(item)
    
    # Check for potential overflow before modifying
    for pos in positions:
        if self.counters[pos] >= self.max_counter:
            # Counter saturated - see overflow handling section
            return False
    
    # Safe to increment all positions
    for pos in positions:
        self.counters[pos] += 1
    
    self.item_count += 1
    return True

def delete(self, item: str) -> bool:
    """
    Delete an item from the filter.
    Returns False if item probably doesn't exist (would cause negative counter).
    """
    positions = self._get_positions(item)
    
    # Verify all counters are positive before decrementing
    for pos in positions:
        if self.counters[pos] <= 0:
            # Item definitely not in filter, or already deleted
            return False
    
    # Safe to decrement
    for pos in positions:
        self.counters[pos] -= 1
    
    self.item_count -= 1
    return True

def contains(self, item: str) -> bool:
    """
    Check if an item is probably in the filter.
    False means definitely not present.
    True means probably present (with false positive rate).
    """
    positions = self._get_positions(item)
    return all(self.counters[pos] > 0 for pos in positions)

The delete operation includes a critical safety check: never decrement a counter below zero. A zero counter means no elements claim that position. Attempting to decrement below zero indicates either a bug (deleting something never inserted) or a false positive deletion attempt.

This check is essential. Negative counters would corrupt the filter, causing false negatives—something Bloom filters should never produce.

Choosing Counter Size and Overflow Handling

Four bits per counter is the industry standard, but it’s not the only option. Here’s the trade-off matrix:

Counter Bits	Max Value	Memory vs. Standard BF	Overflow Risk
2	3	2x	High
4	15	4x	Low
8	255	8x	Negligible

The probability of a counter reaching value c follows approximately a Poisson distribution. For a well-sized filter with k hash functions and n/m load factor around 0.5, the probability of overflow with 4-bit counters is typically less than 10^-15.

When overflow does occur, you have several strategies:

class OverflowStrategy:
    FREEZE = "freeze"      # Stop incrementing, never decrement
    SCALE = "scale"        # Double counter size when needed
    REBUILD = "rebuild"    # Rebuild with larger counters

def insert_with_overflow_handling(
    self, 
    item: str, 
    strategy: str = OverflowStrategy.FREEZE
) -> bool:
    """Insert with configurable overflow handling."""
    positions = self._get_positions(item)
    
    overflow_detected = any(
        self.counters[pos] >= self.max_counter 
        for pos in positions
    )
    
    if overflow_detected:
        if strategy == OverflowStrategy.FREEZE:
            # Increment non-saturated counters, leave saturated ones
            for pos in positions:
                if self.counters[pos] < self.max_counter:
                    self.counters[pos] += 1
            # Mark that we have frozen counters
            self._has_frozen_counters = True
            self.item_count += 1
            return True
            
        elif strategy == OverflowStrategy.REBUILD:
            self._rebuild_with_larger_counters()
            return self.insert(item)
    
    # Normal insertion
    for pos in positions:
        self.counters[pos] += 1
    self.item_count += 1
    return True

def _rebuild_with_larger_counters(self):
    """Rebuild filter with 8-bit counters."""
    # This requires tracking all inserted items
    # which defeats some memory benefits
    raise NotImplementedError(
        "Rebuild requires external item tracking"
    )

The freeze strategy is most common in practice. Once a counter saturates, it stays at max forever. This means you lose the ability to delete elements that hash to that position, but the filter remains correct for membership queries. You’re trading deletion capability for continued operation.

Memory and Performance Comparison

The memory overhead is significant and predictable:

def calculate_memory_usage(
    expected_items: int, 
    false_positive_rate: float,
    counter_bits: int = 4
) -> dict:
    """Compare memory usage between filter types."""
    # Optimal bit array size
    m = int(math.ceil(
        -(expected_items * math.log(false_positive_rate)) / 
        (math.log(2) ** 2)
    ))
    
    standard_bf_bits = m
    counting_bf_bits = m * counter_bits
    
    return {
        "standard_bloom_filter": {
            "bits": standard_bf_bits,
            "bytes": math.ceil(standard_bf_bits / 8),
            "mb": standard_bf_bits / 8 / 1024 / 1024
        },
        "counting_bloom_filter": {
            "bits": counting_bf_bits,
            "bytes": math.ceil(counting_bf_bits / 8),
            "mb": counting_bf_bits / 8 / 1024 / 1024,
            "overhead_factor": counter_bits
        }
    }

# Example: 1 million items, 1% false positive rate
usage = calculate_memory_usage(1_000_000, 0.01)
# Standard BF: ~1.14 MB
# Counting BF: ~4.56 MB (4x overhead)

For 1 million items at 1% false positive rate, you’re looking at roughly 1.14 MB for a standard Bloom filter versus 4.56 MB for a counting variant. Whether this 4x overhead is acceptable depends entirely on your use case.

Performance characteristics are nearly identical. Both structures have O(k) complexity for all operations, where k is the number of hash functions. The counting variant does slightly more work (incrementing vs. setting bits), but this difference is negligible compared to hash computation time.

Practical Use Cases

Counting Bloom filters excel in specific scenarios:

Cache Invalidation: Track which cache entries might be affected by database changes. When data updates, decrement counters for affected keys. This prevents stale cache reads without maintaining a full key index.

Session Tracking: Monitor active user sessions across distributed systems. Insert on login, delete on logout. The false positive rate is acceptable because you’re using this as a fast pre-filter before checking authoritative session storage.

Network Packet Deduplication: Track recently-seen packet hashes to detect duplicates. Old packets naturally age out as you delete them, keeping the filter size bounded.

Database Query Optimization: Track which data blocks might contain relevant rows. As data is deleted or moved, update the filter accordingly. This reduces unnecessary disk reads.

Limitations and Alternatives

Counting Bloom filters aren’t perfect. The most dangerous limitation is false deletion. If you attempt to delete an item that was never inserted but happens to pass the membership test (a false positive), you’ll decrement counters that belong to other items. This corrupts the filter and can cause false negatives.

The only defense is discipline: never delete items without certainty they were inserted. If you can’t guarantee this, counting Bloom filters aren’t appropriate for your use case.

For applications where the 4x memory overhead is problematic, consider these alternatives:

Cuckoo filters: Support deletion with only 2-3x overhead, and often have better lookup performance
Quotient filters: Cache-friendly and support deletion, though more complex to implement
d-left counting Bloom filters: Reduce memory overhead through more sophisticated hashing

Here’s a complete working example:

# Complete usage demonstration
if __name__ == "__main__":
    # Create filter for 10,000 items with 1% false positive rate
    cbf = CountingBloomFilter(
        expected_items=10000, 
        false_positive_rate=0.01
    )
    
    # Insert items
    cbf.insert("user:1001")
    cbf.insert("user:1002")
    cbf.insert("user:1003")
    
    # Check membership
    print(cbf.contains("user:1001"))  # True
    print(cbf.contains("user:9999"))  # False (probably)
    
    # Delete an item
    cbf.delete("user:1002")
    print(cbf.contains("user:1002"))  # False
    
    # Remaining items still present
    print(cbf.contains("user:1001"))  # True
    print(cbf.contains("user:1003"))  # True

Counting Bloom filters are a pragmatic solution when you need deletion support and can tolerate the memory overhead. They’re not always the right choice, but when they fit, they’re simple to implement and reason about. Choose them when deletion is a hard requirement and you’re comfortable with the 4x memory cost.