Bloom Filter Demo
A forty-line Bloom filter in Python with zero dependencies, written to study the structure rather than to ship one. It derives the optimal bit-array size and hash count from a target false-positive rate using the closed-form formulas, builds independent hash positions from a single SHA-256 call via the Kirsch-Mitzenmacher double-hashing trick, and validates the math against a million-URL blocklist: the predicted 1.0039% false-positive rate measures at 1.0260% on the first run, at 1.14 MB of memory versus 116.64 MB for the equivalent Python set. A 102× memory ratio, on a structure that fits in less code than the README that explains it.
1. What I built and why
The project exists to make the math behind Bloom filters readable as code. Every popular tutorial points at the formula as if it were self-evident; my goal was the opposite — implement the structure, derive the formula from first principles in a companion post, then put both under measurement at the scale a real system would use.
Four files, no dependencies past the Python standard library:
| File | What it shows |
|---|---|
bloom.py | The whole filter. ~40 lines: bit array, double-hashing, for_capacity(n, p) constructor that derives and from the formulas. |
demo_one_hash.py | The smallest possible setup — one hash, 128 bits, 20 items — to make the failure mode visible: a 14% false-positive rate before any optimization. |
demo_tuning.py | Drives the tuned constructor at its design point (, ) and compares the measured false-positive rate against the theoretical prediction. |
demo_blocklist.py | The main result: one million URLs in a Bloom filter sized for , with a memory comparison against a plain Python set. |
The choice to write it in Python with hashlib from the standard library, rather than in Rust with mmh3 for speed, was deliberate. The point of the project is the mapping from the formula to the bits, not the lookup throughput; Python’s overhead would dominate any benchmark anyway, and a faster implementation would obscure the very thing the project exists to show. The post explains why the same project written for production would swap SHA-256 for MurmurHash3, but that’s a different project.
2. The forty lines that earn the name
The whole filter:
import hashlib
import math
class BloomFilter:
def __init__(self, m_bits: int, k: int):
self.m = m_bits
self.k = k
self.bits = bytearray(m_bits // 8 + 1)
@classmethod
def for_capacity(cls, n: int, p: float) -> "BloomFilter":
# m = -(n * ln p) / (ln 2)^2; k = (m / n) * ln 2
m = math.ceil(-(n * math.log(p)) / (math.log(2) ** 2))
k = max(1, round((m / n) * math.log(2)))
return cls(m, k)
def _indices(self, item: str):
digest = hashlib.sha256(item.encode("utf-8")).digest()
h1 = int.from_bytes(digest[:8], "big")
h2 = int.from_bytes(digest[8:16], "big")
for i in range(self.k):
yield (h1 + i * h2) % self.m
def add(self, item: str) -> None:
for i in self._indices(item):
self.bits[i // 8] |= 1 << (i % 8)
def contains(self, item: str) -> bool:
return all(
self.bits[i // 8] & (1 << (i % 8)) for i in self._indices(item)
)Three things in those forty lines pull most of the weight.
The bit array is a plain bytearray with manual shift-and-mask. Python has a bitarray package that wraps the same primitives, but the moment you pip install it, the “what is a bit array” idea hides behind a dependency. The shifts (1 << (i % 8)) are how a bit array works; the project’s whole point is to show that, so they stay visible.
The for_capacity(n, p) classmethod is the part that turns the formulas into a usable API. Given an expected item count and a target false-positive rate, it computes and via the closed-form expressions and . The raw BloomFilter(m, k) constructor still exists alongside it because the one-hash demo needs to force to make the failure mode visible — two ways into the same object, each for a different audience.
The _indices generator is the Kirsch-Mitzenmacher double-hashing trick in five lines. One SHA-256 call per item, split into two 64-bit halves, combined as to produce positions. The paper proves the false-positive rate is asymptotically identical to using truly independent hashes, so a reader’s first instinct (” hashes means hash functions”) is wrong, and the project demonstrates why with one hash call instead of seven.
3. The shortcut I refused to take
The shortest path to “a Bloom filter in Python” is pip install pybloom-live and a five-line script. That’s a different artifact (a tutorial on how to call a library) and it teaches nothing about why bits and are the right answers for a million URLs at .
A subtler shortcut would have been to write the implementation but skip the derivation. Most “Bloom filters from scratch” tutorials do exactly this: they list the two formulas, hand-wave at “you can derive this from the union bound,” and move on. The companion post derives both expressions in two paragraphs of algebra, starting from and solving for the optimal at fixed and . The project then validates the derivation by measurement, not by reciting the result.
The third shortcut would have been to not measure. Theoretical false-positive rates are exactly the kind of claim that survives unchallenged in tutorials because verifying them requires building the structure and running it at scale. The blocklist demo does that:
building bloom filter for n=1,000,000, target p=0.01
derived m = 9,585,059 bits (1.14 MB), k = 7
inserts: 7.61s
lookups: 0.32s
false positives: 1026 / 100,000
measured FPR: 1.0260%
predicted FPR: 1.0039%
memory cost of holding the same URLs:
python set: 116.64 MB
bloom filter: 1.14 MB
ratio: 102xThe gap between measured (1.0260%) and predicted (1.0039%) is about 0.02 percentage points on a single run, within the standard error of querying 100,000 strangers from a binomial proportion of 1% (roughly 0.03 percentage points). The math doesn’t just hold at the toy scale of 10,000 items; it holds at the scale a real blocklist would actually use. SHA-256 stays in the implementation specifically because it’s deterministic across machines and Python versions; without that, every run would produce different numbers and the measured-vs-predicted comparison would be noise.
The 102× memory ratio is approximate but conservative. The set measurement counts each URL string’s sys.getsizeof plus the hash table’s own bytes, ignoring object header sharing, string interning, and slab quantization, all of which push the set’s true cost slightly higher. A more careful measurement would land closer to 70-80×. The point survives either way: a forty-line structure, with the formula derived from first principles, replaces a structure two orders of magnitude larger and runs in O(1) lookups while doing it.