MongoDB
Bloom Filters: Space-Efficient Probabilistic Data Structure
Understand Bloom filters and how they enable fast membership testing with minimal memory in systems like Cassandra, HBase, and Chrome.
S
srikanthtelkalapally888@gmail.com
Bloom Filters
A Bloom filter is a space-efficient probabilistic data structure for testing whether an element is a member of a set.
Properties
- Never false negatives: If Bloom filter says NO, element is definitely not in set
- Possible false positives: If Bloom filter says YES, element might be in set
- Size: Fixed memory (bit array)
How It Works
Bit array of m bits, k hash functions
Insert 'apple':
hash1('apple') = 3 → set bit[3] = 1
hash2('apple') = 7 → set bit[7] = 1
hash3('apple') = 12 → set bit[12] = 1
Query 'apple':
Check bits 3, 7, 12 → ALL 1? → Probably YES
Query 'banana':
hash1('banana') = 5 → bit[5] = 0 → Definitely NO
Real-World Uses
Cassandra / HBase
Before disk lookup, check Bloom filter to avoid unnecessary I/O:
Key in filter? No → Skip SSTable (save disk read)
Key in filter? Yes → Read SSTable (might be false positive)
Google Chrome
SafeBrowsing: Bloom filter of malicious URLs (local fast check before server query)
CDNs
Prevent caching one-hit-wonder items (only cache after 2nd request)
False Positive Rate
FPR ≈ (1 - e^(-kn/m))^k
n = items inserted
m = bit array size
k = hash functions
10M URLs, 100MB → <1% false positive rate.
Counting Bloom Filter
Extension that supports deletions (counters instead of bits).
Conclusion
Bloom filters trade perfect accuracy for massive space savings. They're ideal for 'definitely not' fast-path checks.