$ emrebener
home topics systems & infrastructure understanding bloom filters

Understanding Bloom Filters

author: emre bener read time: 10 min about: bloom filter, probabilistic data structure repository: https://github.com/Emrebener/Bloom-Filter-Demo
published: updated: mentions: hash function, sha-2, murmurhash, count-min sketch, log-structured merge-tree, python project: bloom filter demo

1. A Bloom filter in one hash and one bit array

A Bloom filter is a bit array with one or more hash functions that answers set-membership queries with no false negatives and a tuneable false-positive rate. To add an item, hash it, pick the bit at that index, set it to 1. To check membership, hash it, look at the same bit. If the bit is 0, the item is definitely absent. If the bit is 1, the item is probably present, with some chance of being a false positive.

That second sentence is the whole point. The structure trades certainty on the positive side for tiny memory and constant-time lookup. The interesting question is how much certainty you trade. Before any math, it helps to see the failure mode happen.

Here is the smallest version that still earns the name. One bit array, one hash function, two methods:

import hashlib


class BloomFilter:
    def __init__(self, m_bits: int):
        self.m = m_bits
        self.bits = bytearray(m_bits // 8 + 1)

    def _index(self, item: str) -> int:
        digest = hashlib.sha256(item.encode("utf-8")).digest()
        h = int.from_bytes(digest[:8], "big")
        return h % self.m

    def add(self, item: str) -> None:
        i = self._index(item)
        self.bits[i // 8] |= 1 << (i % 8)

    def contains(self, item: str) -> bool:
        i = self._index(item)
        return bool(self.bits[i // 8] & (1 << (i % 8)))

A few choices in that 20-line block are worth naming, because they affect everything that follows.

The bit array is a plain bytearray. Python has a bitarray package, but pulling in a dependency for “an array of bits” hides the very thing this post is about. The shifts (1 << (i % 8)) are how a bit array works; they shouldn’t be wrapped.

The hash is SHA-256, which is dramatic overkill. Bloom filters do not need cryptographic strength. The reason is determinism, not security. Python’s built-in hash() is randomized per interpreter run (PEP 456), which would make every run of the demo produce different numbers. SHA-256 always returns the same digest for the same input, so the measurements in section 3 can be reproduced exactly. A faster non-cryptographic hash like mmh3 would be a more honest production choice; section 2 explains why one good hash is enough to derive many.

There is no remove method, and there cannot be. Two items with overlapping bits share state, so unsetting a bit for one would also remove the other. This is the first hint that the structure isn’t a regular set; it’s designed to forget less than it remembers, in a specific way.

1.1. Watching the failure mode

With m = 128 bits, insert 20 items and ask the filter about 10,000 strangers it has never seen:

from bloom import BloomFilter

bf = BloomFilter(128)
for i in range(20):
    bf.add(f"member-{i}")

false_positives = sum(
    1 for j in range(10_000) if bf.contains(f"stranger-{j}")
)
print(f"measured false-positive rate: {false_positives / 10_000:.3%}")

Output:

m = 128 bits, inserted = 20 items
queried 10000 strangers, 1429 false positives
measured false-positive rate: 14.290%
naive expectation (N / m):    15.625%

Roughly one in seven absent items comes back as present. That’s catastrophic for almost any application, and it is the expected behavior. With one hash and a small bit array, ~20/128 of the bits are set after the inserts, so any stranger whose hash lands on one of those bits collides.

The measured rate is slightly below the naive N/m estimate, not above it. The naive formula assumes every insert sets a fresh bit, but some of the 20 inserts collide with each other, so the number of set bits is less than 20. Fewer set bits means fewer false positives. The exact derivation, which accounts for collisions among the inserts, lands in section 2 as a one-line expression involving e.

The failure mode here is not “Bloom filters don’t work.” It is “one hash is the wrong number of hashes.” With one hash, the only way to reduce false positives is to grow m until almost no bits are set, which destroys the memory advantage. The next section adds more hashes and shows that, for any target false-positive rate, there is an optimal k and m that you can compute from first principles.

128-bit array after 20 inserts (single-hash, m=128)XXunset (0)set (1)Xcollision (2nd insert hit same bit)18 distinct bits set out of 128 (~14%); 2 of the 20 inserts collided128-bit array after 20 inserts (single-hash, m=128)XXunset (0)set (1)Xcollision (2nd insert hit same bit)18 distinct bits set out of 128 (~14%); 2 of the 20 inserts collided

2. Sizing the bit array and hash count for a target false-positive rate

A Bloom filter has two dials. The bit array size mm controls how much memory the structure uses, and the hash count kk controls how many bits each item touches. Given an expected item count nn and a target false-positive rate pp, both dials are fixed by closed-form expressions:

m=nlnp(ln2)2,k=mnln2m = -\frac{n \ln p}{(\ln 2)^2}, \qquad k = \frac{m}{n} \ln 2

For n=10,000n = 10{,}000 items and p=0.01p = 0.01 (one false positive in a hundred), this comes out to m95,851m \approx 95{,}851 bits (about 12 KB) and k=7k = 7 hashes per item. That’s the ten-bits-per-item, seven-hashes-for-one-percent rule, and it falls out of two short derivations.

2.1. Where the formula comes from

After inserting nn items, each setting kk bits, the probability that a specific bit is still zero is the probability that none of the knkn bit-setting operations picked it:

P(bit still zero)=(11m)knekn/mP(\text{bit still zero}) = \left(1 - \frac{1}{m}\right)^{kn} \approx e^{-kn/m}

A false positive happens when all kk bits checked for an unseen item are set. Assuming independence, the false-positive rate is:

p=(1ekn/m)kp = \left(1 - e^{-kn/m}\right)^k

That is the only formula in this post worth memorizing. Everything else is algebra. Solving for the kk that minimizes pp at fixed mm and nn gives k=(m/n)ln2k^* = (m/n)\ln 2, and substituting back gives m=nlnp/(ln2)2m^* = -n \ln p / (\ln 2)^2. Both match the formulas in the previous subsection.

2.2. Deriving k indices from one hash (Kirsch-Mitzenmacher)

The obvious way to get kk hash functions is to pick kk of them: SHA-256, SHA-1, BLAKE2, MD5, MurmurHash, and so on. That works, but it’s wasteful. A 2006 result by Kirsch and Mitzenmacher (Less Hashing, Same Performance) showed that two good hashes are enough. Given h1(x)h_1(x) and h2(x)h_2(x), you can synthesize kk indices as:

gi(x)=(h1(x)+ih2(x))modm,i=0,1,,k1g_i(x) = (h_1(x) + i \cdot h_2(x)) \bmod m, \quad i = 0, 1, \ldots, k-1

The paper proves the false-positive rate is asymptotically identical to using kk truly independent hashes. In practice you can get both halves from a single hash call by splitting a wide digest. SHA-256 produces 32 bytes, so the first eight bytes become h1h_1 and the next eight become h2h_2. One hash call per item, kk index computations:

class BloomFilter:
    def __init__(self, m_bits: int, k: int):
        self.m = m_bits
        self.k = k
        self.bits = bytearray(m_bits // 8 + 1)

    @classmethod
    def for_capacity(cls, n: int, p: float) -> "BloomFilter":
        m = math.ceil(-(n * math.log(p)) / (math.log(2) ** 2))
        k = max(1, round((m / n) * math.log(2)))
        return cls(m, k)

    def _indices(self, item: str):
        digest = hashlib.sha256(item.encode("utf-8")).digest()
        h1 = int.from_bytes(digest[:8], "big")
        h2 = int.from_bytes(digest[8:16], "big")
        for i in range(self.k):
            yield (h1 + i * h2) % self.m

    def add(self, item: str) -> None:
        for i in self._indices(item):
            self.bits[i // 8] |= 1 << (i % 8)

    def contains(self, item: str) -> bool:
        return all(
            self.bits[i // 8] & (1 << (i % 8)) for i in self._indices(item)
        )

Forty-odd lines of Python with no dependencies, with the same asymptotic guarantees as a textbook Bloom filter.

A few notes on choices that don’t show up in the line count.

SHA-256 is overkill here, the same way it was overkill in section 1. A non-cryptographic hash like MurmurHash3 or xxHash would be several times faster, and a real implementation would use one. SHA-256 stays because it makes the demo deterministic across machines and Python versions. The set of measurements that the rest of the post relies on are easier to interpret when the hash isn’t adding its own noise.

The for_capacity classmethod coexists with the raw BloomFilter(m, k) constructor on purpose. The raw constructor is what section 1’s broken-on-purpose demo uses to force k=1k = 1; the classmethod is what you would actually call in real code.

One SHA-256 digest -> two halves -> k bit-array positionsURL stringSHA-256h1 (bytes 0..8)h2 (bytes 8..16)(unused for k=7)g_i = (h1 + i * h2) mod mbit array(size m)g0g1g2g3g4g5g6k = 7: one hash call, two 8-byte halves, seven indices via linear combinationsOne SHA-256 digest -> two halves -> k bit-array positionsURL stringSHA-256h1 (bytes 0..8)h2 (bytes 8..16)(unused for k=7)g_i = (h1 + i * h2) mod mbit array(size m)g0g1g2g3g4g5g6k = 7: one hash call, two 8-byte halves, seven indices via linear combinations

2.3. Does the math actually predict reality?

Drive the tuned constructor at its design point and measure what comes out:

N, P = 10_000, 0.01
bf = BloomFilter.for_capacity(N, P)
for i in range(N):
    bf.add(f"member-{i}")

false_positives = sum(
    1 for j in range(10_000) if bf.contains(f"stranger-{j}")
)
measured = false_positives / 10_000
predicted = (1 - math.exp(-bf.k * N / bf.m)) ** bf.k

Output:

target FPR p = 0.01
derived m = 95851 bits (11.70 KB), k = 7
measured FPR over 10000 strangers: 1.0100%
predicted FPR (1 - e^(-kn/m))^k:         1.0039%

The measured rate matches the predicted rate to within a hundredth of a percentage point, on the first run, with no averaging across trials. The bit array is 11.7 KB. A Python set holding 10,000 short strings is roughly 50 to 80 times larger. The next section pushes both numbers up by two orders of magnitude and watches whether the prediction still holds.

3. A million-URL blocklist benchmark

At one million URLs and a 1% false-positive target, the filter occupies 1.14 MB versus 116.64 MB for a Python set: a 102× memory ratio. That ratio is the entire reason Bloom filters keep showing up in production systems. The benchmark behind that number uses the same forty-line implementation from section 2, driving it with a URL safety check as the motivating use case: before fetching a link, ask the filter whether it appears in a blocklist of known phishing or malware domains.

3.1. The benchmark and what it does not measure

The setup is synthetic. One million URLs are generated deterministically from an integer range so the run is reproducible across machines; 100,000 stranger URLs are generated the same way for the negative queries. The script lives in demo_blocklist.py in the companion repo.

What this benchmark does not measure:

  • Real-world URL distributions. Production URLs share substrings and would interact with non-cryptographic hashes differently from SHA-256.
  • Insert/lookup throughput beyond pure Python overhead. SHA-256 in CPython dominates the per-op cost; a Rust implementation with MurmurHash3 would be roughly two orders of magnitude faster.
  • Steady-state behavior under churn. Bloom filters don’t support deletion, so a long-lived blocklist needs a rotation strategy, which is its own topic.

The point of the run is to validate the theory at scale, not to claim production numbers.

3.2. The numbers

building bloom filter for n=1,000,000, target p=0.01
  derived m = 9,585,059 bits (1.14 MB), k = 7
  inserts: 7.61s
  lookups: 0.32s

false positives: 1026 / 100,000
  measured FPR:  1.0260%
  predicted FPR: 1.0039%

memory cost of holding the same URLs:
  python set:    116.64 MB
  bloom filter:  1.14 MB
  ratio:         102x

The same (1 - e^{-kn/m})^k formula from section 2 predicts 1.0039%; the actual measurement comes in at 1.0260%. The gap is about 0.02 percentage points on a single run, which is within the statistical noise of querying only 100,000 strangers (the standard error on a binomial proportion of 1% over 100k trials is about 0.03 percentage points). The math does not just hold at the toy scale of section 2; it holds at the scale a real system would actually use.

The 102× memory ratio is approximate but conservative. The set measurement counts each URL string’s sys.getsizeof plus the hash table’s own bytes. It ignores Python object header sharing, string interning, and the slab allocator’s quantization, all of which would push the set’s true cost slightly higher. A more careful measurement might land at 70-80×; the point is the same.

3.3. What “1% false positives” buys you

The numeric win is memory. The structural win is what the filter does for the system around it. A typical blocklist deployment looks like this:

  1. The application asks the Bloom filter whether a URL is blocked.
  2. If the filter says no, the URL is definitely safe and the system serves the request immediately.
  3. If the filter says yes, the URL is probably blocked, and the system falls through to an authoritative check: a database lookup, an RPC to a blocklist service, or a slower disk-resident structure.

At a 1% false-positive rate, 99% of legitimate traffic skips the authoritative check entirely. The expensive lookup runs only on actual matches plus the 1% of misclassified safe URLs. This is the reason Bloom filters appear so often in three places: databases use them in LSM-tree read paths to avoid disk seeks for non-existent keys, CDNs use them for cache-fill decisions and security tooling uses them for the URL example above. The slow path stays slow, but it runs so rarely that the system’s average latency tracks the fast path.

The filter cannot be the source of truth, because it lies on the positive side by design. It is a first-line filter, sitting in front of something authoritative. Treating it as anything else is the most common way to misuse it.

Filter-first deployment: 99% of queries skip the authoritative checkIncoming URLBloom filter1.14 MB, k = 7says nosays yesServe immediatelyAuthoritative check(DB / RPC / disk)~99% of legitimate trafficServe or blockactual matches + 1% misclassifiedSlow path stays slow but runs so rarely the system's average latency tracks the fast pathFilter-first deployment: 99% of queries skip the authoritative checkIncoming URLBloom filter1.14 MB, k = 7says nosays yesServe immediatelyAuthoritative check(DB / RPC / disk)~99% of legitimate trafficServe or blockactual matches + 1% misclassifiedSlow path stays slow but runs so rarely the system's average latency tracks the fast path

3.4. Limits of the plain Bloom filter

A few limitations worth naming.

No deletes. A standard Bloom filter cannot remove items, because unsetting a bit for one item would corrupt the membership signal for every other item that shares that bit. The standard workaround is the Counting Bloom Filter, which stores small counters instead of single bits, at roughly 4× the memory. If a blocklist needs to evict entries (a phishing URL gets taken down), counting is the upgrade path.

No graceful resizing. The filter is sized for a fixed nn and starts degrading once you exceed it. Scalable Bloom Filters address this by chaining sub-filters at increasing capacities, but the result is no longer a single bit array. For a blocklist that grows by a known small factor per year, sizing at the headroom is fine; for unbounded growth, a different structure is warranted.

No counting how many times an item was added. A Bloom filter answers “have I seen this?”, not “how often?”. If frequency matters, the Count-Min Sketch is the structure to reach for; the design choices are similar in spirit (probabilistic, sub-linear memory, tunable error), but the math is different.

Each of these is a different paper and a different post. The plain Bloom filter is what to use when the question is exactly “is this item in the set?”, the answer “no” must be correct, and the cost of an occasional unnecessary downstream check is low. For a million-URL blocklist, that fits. The entire implementation, from add to for_capacity to the measurement that validates the theory, fits in less code than the README that explains it.