MongoDB

Bloom Filters: Space-Efficient Probabilistic Data Structure

Understand Bloom filters and how they enable fast membership testing with minimal memory in systems like Cassandra, HBase, and Chrome.

S

srikanthtelkalapally888@gmail.com

Bloom Filters

A Bloom filter is a space-efficient probabilistic data structure for testing whether an element is a member of a set.

Properties

  • Never false negatives: If Bloom filter says NO, element is definitely not in set
  • Possible false positives: If Bloom filter says YES, element might be in set
  • Size: Fixed memory (bit array)

How It Works

Bit array of m bits, k hash functions

Insert 'apple':
  hash1('apple') = 3 → set bit[3] = 1
  hash2('apple') = 7 → set bit[7] = 1
  hash3('apple') = 12 → set bit[12] = 1

Query 'apple':
  Check bits 3, 7, 12 → ALL 1? → Probably YES

Query 'banana':
  hash1('banana') = 5 → bit[5] = 0 → Definitely NO

Real-World Uses

Cassandra / HBase

Before disk lookup, check Bloom filter to avoid unnecessary I/O:

Key in filter? No → Skip SSTable (save disk read)
Key in filter? Yes → Read SSTable (might be false positive)

Google Chrome

SafeBrowsing: Bloom filter of malicious URLs (local fast check before server query)

CDNs

Prevent caching one-hit-wonder items (only cache after 2nd request)

False Positive Rate

FPR ≈ (1 - e^(-kn/m))^k

n = items inserted
m = bit array size
k = hash functions

10M URLs, 100MB → <1% false positive rate.

Counting Bloom Filter

Extension that supports deletions (counters instead of bits).

Conclusion

Bloom filters trade perfect accuracy for massive space savings. They're ideal for 'definitely not' fast-path checks.

Share this article