Building a Key-Value Store From Scratch, Vol. 2: Performance Tuning
Vol 1 left MiniKV running at 17k writes per second, with three named suspects for the slowness and a closing note about a possible perf-focused follow-up. This is the follow-up. Before any of those three got attention though, the profiler surfaced a fourth suspect that dominated everything else by an order of magnitude and rearranged the running order of the next four sections.
The goal of this volume is to improve the performance of MiniKV, with a “profile first, optimize second” methodology. Wire measurement into the harness, look at what’s actually expensive, fix that. Sometimes the result is a 15× improvement on the first cut. Other times the result is a clean implementation, a shifted profile, and a bench that doesn’t move. Both kinds of result are in this post, because both kinds happened.
1. Profiling first, optimizing second
Before any code changed, the bench harness needed two things it didn’t have: a concurrency knob, so contention was actually exercised, and a profiler attached to the running binary, so the cost of each operation was visible. With those in place, the first measurement told a story none of vol 1’s three suspects predicted, and the rest of this section is that story.
1.1. Two knobs the harness was missing
The bench from vol 1 §5 has two limits that hide problems. It runs strictly sequentially against a single connection, so any lock contention inside the server is invisible: every operation acquires the mutex uncontested. And it reports ops/sec = count / sum(latencies), which only equals true wall-clock throughput when there’s nothing happening in the loop between operations. Both limits stop being acceptable the moment the question shifts from “how fast is this in isolation” to “how does this behave under load.”
Two changes, then, before any optimization. The first is a -concurrency N flag on cmd/minikv-bench that opens N client connections and splits each run’s keyspace across N workers. A barrier between the write phase and the read phase keeps the shape of vol 1’s measurement intact. The second is an explicit wall-clock argument on Summarize:
func Summarize(samples []time.Duration, wallClock time.Duration) Summary {
// ...
opsPerSec := 0.0
if wallClock > 0 {
opsPerSec = float64(len(sorted)) / wallClock.Seconds()
}
// ...
}At concurrency=1 with the existing single-threaded loop, the new formula lands within ~2% of the old number, well inside run-to-run noise. At concurrency=8 with eight workers in flight it’s the difference between an honest throughput figure and one that overstates by ~8x.
1.2. A profiler the binary was missing
Vol 1’s MiniKV had no way to introspect itself beyond slog. The cheapest fix is net/http/pprof, gated by an opt-in flag so nothing changes about the production runtime when nobody asks for it:
pprofAddr := flag.String("pprof-addr", "",
"HTTP address for net/http/pprof handlers; empty disables")The endpoint runs on its own HTTP listener, never on the cache port. The distroless runtime image stays the same, with pprof served from inside the binary and no external debug tooling baked in. The Compose stack publishes the pprof port on host 16060 so go tool pprof can scrape it from the laptop while the bench drives load from inside the container network:
go tool pprof http://127.0.0.1:16060/debug/pprof/profile?seconds=10A throwaway port and an opt-in flag aren’t architecture decisions; they’re the cheapest possible way to get a useful profile out of an existing binary.
1.3. The baseline that exposed an asymmetry
With both knobs in place, two baselines. Concurrency=1 reproduces vol 1’s configuration exactly: 5 runs × 1000 keys × 128-byte values. Concurrency=8 is the same total work, split across eight client connections instead of one.
The cross-service numbers below are reference points, not scoreboard entries. MiniKV is not chasing Redis or Memcached on absolute throughput. The interesting comparison is between MiniKV’s own two columns: what concurrency does to MiniKV specifically.
| service | workload | conc=1 ops/s | conc=8 ops/s | scaling |
|---|---|---|---|---|
| minikv | write | 17,341 | 14,378 | 0.83x ⚠ |
| minikv | read | 47,334 | 211,245 | 4.46x |
| redis | write | 46,025 | 123,470 | 2.68x |
| redis | read | 48,500 | 115,939 | 2.39x |
| memcached | write | 48,754 | 260,140 | 5.34x |
| memcached | read | 47,546 | 282,169 | 5.93x |
Reads scale ~4.5x. The read path’s bottleneck is round-trip latency on each connection, and adding connections is the right answer. Writes go backwards. Redis scales writes 2.7x, Memcached 5.3x; MiniKV regresses to 83% of its single-client throughput.
That’s the signature of a fat critical section. Whatever happens inside Store.mu on the write path is held long enough that more clients hurt rather than help. The lock itself isn’t the cause. It’s the visible queue forming around whatever work the lock is serializing.
1.4. The CPU profile, then a surprise
A ten-second CPU profile, captured against the running container during a sustained concurrency=8 write workload (top three nodes account for 88% of cumulative CPU):
| flat | flat% | sum% | cum | cum% | function |
|---|---|---|---|---|---|
| 1.68s | 34.78% | 34.78% | 2.44s | 50.52% | runtime.mapiternext |
| 1.44s | 29.81% | 64.60% | 4.25s | 87.99% | store.(*Store).removeExpiredLocked |
| 0.29s | 6.00% | 70.60% | 0.37s | 7.66% | store.storedItem.expired |
Eighty-eight percent of cumulative CPU lives inside removeExpiredLocked, with runtime.mapiternext showing up at fifty percent. That’s the Go runtime walking the items map. The function body is six lines:
func (s *Store) removeExpiredLocked(now time.Time) {
for key, item := range s.items {
if item.expired(now) {
s.deleteLocked(key, item)
}
}
}The call site is one line, at the top of Set:
func (s *Store) Set(key string, value []byte, ttl time.Duration) error {
s.mu.Lock()
defer s.mu.Unlock()
if len(value) > s.config.MaxValueBytes {
return ErrValueTooLarge
}
now := s.now()
s.removeExpiredLocked(now) // O(N) over the whole store, every Set
// ...
}Vol 1 §2.2 described expiry as lazy: “when the store touches a key, it checks the deadline first and removes the entry if expired.” Vol 1 §2.4 added a periodic ticker to handle keys nobody reads again. Both are real, both still in the code, both working as documented. The unconditional full-map sweep on every successful write is a third expiration mechanism that the post never described and that nobody on this side of the keyboard asked for.
The numbers fall out instantly once you see it. After the bench runs for a few seconds the store holds 200k-300k keys. Every Set does, in order: take the lock, walk all 200k-300k entries to check expiresAt, then do the actual write. That’s the whole write path. It isn’t “a hash-map insert behind a mutex.” It’s an O(N) scan behind a mutex, with a hash-map insert at the end.
This is the thing profiling exists for. Three plausible suspects from vol 1, none of them dominant. The actual dominant cost was a divergence between what the post said the code did and what the code actually did, and the only way to find it was to run the binary under measurement and read the result.
1.5. What this changes for the rest of the post
The original sequence (attack allocations, then shard the mutex, then rework LRU bookkeeping) still makes sense, but every step in that plan now shares a different starting point. Removing the per-Set sweep is one obvious change that has to land before any of the planned work earns its keep. So the next section opens there, before circling back to the allocation story underneath. Once the sweep is gone the profile will be different, and the steps after that reshape against whatever the new dominant cost turns out to be. There’s a real chance one of them shrinks to noise. Fine. A post about perf optimization that claims every planned change was a win is a post about marketing, not measurement.
The harness has limits worth naming up front. With pprof attached and a concurrency knob, this volume can show contention shape (does throughput rise or fall as N grows?), per-function CPU cost, and allocation density at the call site. It can’t show what persistence behavior costs, or how the cache holds up across minutes of mixed pressure, or how clients with pipelining behave. None of those are in scope. The bench remains a microscope for the write path, not a production load test.
2. The write path’s actual cost
The dominant cost on the write path was not an allocation. It was a six-line function called unconditionally at the top of every Set, walking the entire item map looking for expired entries that, in this bench, didn’t exist. Removing that one call lifted concurrency=8 writes from 14k ops/s (regressing under contention) to ~200k ops/s (scaling cleanly). The originally-planned allocation work followed and produced a 3.2× cut in object allocations across the hot path, with no visible effect on throughput at the bench’s scale. Both results were in the profile from the start. The job was to read it correctly.
2.1. Removing the sweep
The fix is mechanical. The unconditional s.removeExpiredLocked(now) at the top of Set moves into the memory-pressure branch where it actually belongs:
projected := s.memoryBytes - oldSize + size
if projected > s.config.MaxMemoryBytes {
s.removeExpiredLocked(now) // moved from top of Set
oldSize = 0
if old, ok := s.items[key]; ok {
oldSize = old.size
}
projected = s.memoryBytes - oldSize + size
if projected > s.config.MaxMemoryBytes {
if !s.evictUntilFitsLocked(key, projected) {
return ErrMemoryLimitExceeded
}
projected = s.memoryBytes - oldSize + size
}
}This restores the design vol 1 §3.2 actually claimed, “on a memory-pressure write, the store sweeps expired entries first.” The original code accidentally did the sweep on every write.
A small safety net joins the top of Set to fill the gap. If the key being written is itself expired, treat it as missing for accounting purposes:
oldSize := 0
if old, ok := s.items[key]; ok {
if old.expired(now) {
s.deleteLocked(key, old)
} else {
oldSize = old.size
}
}This makes Set honor the same “expired means missing” rule Get, Delete, and Incr already had per vol 1 §2.2. The implementation had quietly left Set out, presumably because the unconditional sweep was making the gap invisible.
After these two changes the conc=1 bench moved from ~17k writes/s to ~46k, and conc=8 from 14k (negative scaling) to ~200k. Reads were unchanged because the sweep only ran on the write path.
2.2. The pressure case
Profiling again, this time against a sustained 500k-write concurrency=8 workload designed to provoke memory pressure. The result was unwelcome:
| flat | flat% | sum% | cum | cum% | function |
|---|---|---|---|---|---|
| 1.92s | 20.25% | 45.36% | 4.48s | 47.26% | store.(*Store).removeExpiredLocked |
The sweep was back. Different call site, same function, same complexity.
The path is straightforward to walk. The bench writes more keys than fit in the 64-MiB limit, so once a few seconds have passed almost every Set is a pressured write, which means removeExpiredLocked runs. And in this bench every key has TTL=0 because the bench never asks for an expiration. So removeExpiredLocked walks the entire map looking for expired items, finds none, returns. Per pressured write, in the worst case.
The fix is one integer. Track the count of items currently holding a non-zero TTL; if it’s zero, skip the sweep entirely.
type Store struct {
// ...
ttlCount int
}Set increments when the new value carries a TTL; deleteLocked (the choke point for every removal path) decrements when the removed item carried a TTL. The pressure-side sweep gates on s.ttlCount > 0. Workloads without TTLs skip the O(N) walk entirely, and workloads that do use TTLs keep vol 1 §3.2’s strict “expired-before-live” guarantee.
The same 500k-write workload now sustains ~215k writes/s at concurrency=8 even with eviction firing on basically every write. The CPU profile shifts dramatically: removeExpiredLocked falls out of the top 20 entirely, and runtime/internal/syscall.Syscall6 (the kernel doing TCP reads and writes) becomes the new dominant cost at 54%. That’s a healthier shape, since most of the time is now in the kernel rather than in a wasted Go loop.
2.3. Honoring vol 1’s “revisited later”
Vol 1 §1.4 said:
The rule: callers pass slices in and receive slices out, but they can’t alias the map’s contents. A known trade-off, revisited later.
This is the later. The store had been doing a defensive cloneBytes(value) on every Set, hedging against the parser someday reusing its read buffer. Looking at the actual parser, every ReadCommand allocates a fresh []byte via make([]byte, size+2) and the slice doesn’t outlive the command. The parser is already giving the store a privately-owned buffer; the second copy is redundant.
The contract tightens to match. Drop the cloneBytes on Set, and state the new ownership rule on the public godoc:
// Set stores value under key with the given TTL.
//
// The store takes ownership of value's underlying bytes. Callers must not
// mutate the slice after Set returns. Get returns a private copy, so callers
// may safely mutate slices returned from Get.
func (s *Store) Set(key string, value []byte, ttl time.Duration) error {
// ...
}Get still copies on the way out, so the read path remains insulated from caller mutation. Only the write contract changed.
The test that protected the old contract had to go too. TestStoreCopiesValuesAcrossSetAndGet was specifically asserting “mutating the input after Set doesn’t affect storage”, which is exactly the property the new contract gives up. Renamed it to TestStoreInsulatesGetCallersFromEachOther and trimmed it to assert only the surviving guarantee. The other ten tests passed unchanged.
2.4. The allocation cuts the profile said wouldn’t matter
The original plan listed two more places to attack on the read of every command: bufio.Reader.ReadString (which allocates internal strings.Builder growths to assemble the line) and writeLine (which builds a temporary string via line + "\r\n" before each write). The CPU profile after the sweep fix said both were sub-1%. They got addressed anyway, since allocation count is its own measure and worth tracking independently of CPU share.
The parser change uses ReadSlice, which returns a slice into bufio’s internal buffer with zero allocations. The single string() conversion that follows is unavoidable because strings.Fields further down still wants a string. Net: instead of multiple internal strings.Builder.grow allocations to assemble the line, we get one string allocation to use it.
lineBytes, err := p.reader.ReadSlice('\n')
// ...
line := string(lineBytes[:len(lineBytes)-2]) // strip CRLF, one allocationwriteLine was even simpler. Two WriteString calls instead of one, no string concatenation. The bufio.Writer doesn’t care; it’s all going into the same buffer.
func writeLine(writer *bufio.Writer, line string) error {
if _, err := writer.WriteString(line); err != nil {
return err
}
_, err := writer.WriteString("\r\n")
return err
}Total allocation count over a comparable workload window dropped from 754k objects to 237k, a 3.2× reduction. Throughput at concurrency=8 stayed at ~215k writes/s, the same as before the alloc cuts, within noise across three repeated runs.
2.5. Calibration notes before the numbers
Two honest items before the bench table.
The new bench harness from §1 had a slight issue. The original benchmark.Run from vol 1 §5 interleaved writes and reads per run: write the run’s keys, read those keys back, then move to the next run. The §1 refactor that added concurrency accidentally collapsed both phases out to the outer loop, all writes across all runs, then all reads. Invisible at vol 1’s published 5×1000 configuration (1 MB total, no eviction), fatal at 5×60000 because keys written first were evicted before their reads came around. Restored the per-run interleave. The numbers from §1 stand because the small workload didn’t trigger eviction; anyone re-running at larger sizes needs the harness fix.
The other note is workload-dependent. The numbers below use the same configuration as vol 1 §5: 5 runs × 1000 keys × 128-byte values. That doesn’t trigger memory pressure, so the ttlCount change in §2.2 makes no difference at this size. To exercise the pressure path we run a separate sustained-load configuration: 5 runs × 60,000 keys × 128 bytes, ~66 MiB intended occupancy against the 64-MiB limit, so almost every Set past the first ~250k operations is pressured. That number is reported below as a calibration receipt for §2.2, not the headline.
| service | workload | conc=1, vol 1 | conc=1, vol 2 | conc=8, vol 1 | conc=8, vol 2 |
|---|---|---|---|---|---|
| minikv | write | 17,341 | 46,118 | 14,378 | 216,191 |
| minikv | read | 47,334 | 47,752 | 211,245 | 220,293 |
| redis | write | 46,025 | 48,370 | 123,470 | 128,957 |
| memcached | write | 48,754 | 51,384 | 260,140 | 263,301 |
Vol 1’s MiniKV writes were ~36% of Memcached’s at concurrency=1; vol 2’s are ~90%. At concurrency=8 they were 5.5% of Memcached’s because they regressed under contention; vol 2’s are ~82%, and they exceed Redis (~129k) at the same concurrency. Reads were already at parity in vol 1; they remain there.
The pressure run, in the same units:
| service | workload | count | mean | p50 | p95 | max | ops/sec |
|---|---|---|---|---|---|---|---|
| minikv | write | 300,000 | 0.036ms | 0.033ms | 0.055ms | 5.790ms | 213,620 |
| minikv | read | 300,000 | 0.036ms | 0.034ms | 0.054ms | 2.999ms | 215,838 |
The next section takes what the post-§2 profile is now showing (kernel TCP traffic and Store.mu contention) and asks whether sharding the global mutex still earns its place in the plan, or whether the bench has shifted enough that the answer is now somewhere else.
3. Splitting the global mutex
Sharding the global mutex into sixteen independently-locked partitions is structurally clean and barely visible at this bench’s scale. Each key now lives in one of N shards, each shard owns its own map, LRU list, memory budget and lock. Tests pass. Per-shard contention is eliminated. Benching at concurrency 8 and concurrency 32, with Shards set to 1, 16, or 64, all three configurations land within run-to-run noise of each other. After section 2 killed the unconditional sweep, the lock simply stopped being the gating factor; sharding it now is structurally correct preparation for concurrency this bench doesn’t reach.
3.1. Shape of the split
The store stops owning the map directly. It owns N shards, each of which owns a map, an LRU list, a memory counter, and a TTL counter:
type shard struct {
mu sync.Mutex
items map[string]storedItem
lru *list.List
memoryBytes int
ttlCount int
maxValueBytes int
maxMemoryBytes int
itemOverheadBytes int
now func() time.Time
}
type Store struct {
shards []*shard
}The public API stays exactly the same. Every method routes:
func (s *Store) Set(key string, value []byte, ttl time.Duration) error {
return s.shardFor(key).set(key, value, ttl)
}The hash is FNV-1a-32, inlined to stay allocation-free on the hot path. Calling hash/fnv.New32a() per request would allocate a hasher struct, which is exactly what section 2 spent effort eliminating, so the loop is open-coded:
func shardIndex(key string, n int) uint32 {
const (
offset uint32 = 2166136261
prime uint32 = 16777619
)
h := offset
for i := 0; i < len(key); i++ {
h ^= uint32(key[i])
h *= prime
}
return h % uint32(n)
}Each shard’s memory budget is MaxMemoryBytes / Shards rounded down. With the default 64 MiB and 16 shards, that’s 4 MiB per shard. Static division was a deliberate choice over a global atomic counter shared across all shards: a global counter would force cross-shard locks during eviction and re-introduce the contention that the split was supposed to remove.
3.2. The LRU guarantee gives ground
Vol 1 §3.1 called the eviction policy “exact least-recently-used”, with the recency refresh moving a node in a single store-wide list. Sharding makes that statement narrower. The new guarantee is exact LRU within a shard, approximate across the store: a key is evicted only if it’s the LRU-oldest of its shard’s items, not necessarily the LRU-oldest of the whole store.
In practice that’s a small loss with FNV-1a’s distribution and reasonably diverse keys. A heavily-accessed key still won’t be evicted while its shard has older entries to evict first. The pathological case is a key set whose hashes happen to all land on the same shard while other shards stay empty: possible in principle, unlikely with FNV on real keys, and not something the bench provoked.
The alternative was rejected on engineering grounds. A truly global LRU under sharding would either require a single shared list (locked on every update, defeating the split) or a cross-shard eviction protocol that walks every shard’s LRU tail pointer to find the global oldest (much more code, locks pairs of shards under contention, easy to deadlock). Either path costs more than the LRU fairness it preserves. The pragmatic answer in production caches like Memcached is the same one MiniKV lands on here: per-shard LRU with the relaxation stated explicitly.
3.3. The tests had to be told
The change is internal but it’s not invisible to existing tests. Several store and server tests use very small MaxMemoryBytes (4 or 6 bytes total) to test the eviction policy deterministically, vol 1’s pattern. With sixteen default shards those budgets divide to zero per shard, and every write returns SERVER_ERROR memory limit exceeded on the first attempt.
The fix isn’t to soften the tests; the tests are right about what they’re testing. They get an explicit Shards: 1:
s := New(Config{
MaxValueBytes: 16,
MaxMemoryBytes: 6,
ItemOverheadBytes: 0,
Shards: 1,
})Eight tests across internal/store and internal/server got the addition. They’re testing eviction policy (which item gets evicted, in what order, under what conditions), and policy is deterministic only when there’s a single shared list to apply it to. Tests that use DefaultConfig() (64 MiB total, 4 MiB per shard) keep the default 16 shards and pass unchanged, which is reassurance that the multi-shard path works for any realistically-sized workload.
3.4. What the bench actually says
The honest comparison isn’t vol 2’s sharded numbers against vol 1’s, since section 2 already did the heavy lifting. The interesting comparison is sharded against not-sharded with the rest of section 2 in place. The new -shards flag on the binary lets us run the same code at three configurations:
| concurrency | Shards=1 | Shards=16 | Shards=64 |
|---|---|---|---|
| 8 (writes) | 208,000 | 219,000 | 215,000 |
| 32 (writes) | 317,000 | 313,000 | 328,000 |
All within ±5% of each other across two repetitions per cell. The CPU profile against a sustained 16-worker write load tells the same story:
| flat% | sum% | cum% | function |
|---|---|---|---|
| 46.35% | 46.35% | 46.35% | runtime/internal/syscall.Syscall6 |
| 15.53% | 61.88% | 15.53% | runtime.futex |
| 5.84% | 67.72% | 5.84% | runtime.procyield |
| … | … | … | … |
| 0.96% | 74.62% | 12.79% | runtime.lock2 |
Forty-six percent of CPU is in Syscall6, the kernel doing TCP reads and writes. Fifteen percent is in runtime.futex, futex syscalls coming from the Go scheduler. The Store’s own functions don’t appear in the top twenty. Whatever lock pressure is left is mostly Go’s GMP scheduler (the goroutine scheduler, which locks its own per-P work queues during work-stealing), not the store’s sync.Mutex, and sharding our mutex doesn’t help with the runtime’s locks.
Why didn’t sharding help more? Because after section 2 the critical section is short. The lock is held for a few map operations and an LRU pointer move, on the order of a hundred nanoseconds. At eight or thirty-two worker connections that’s not enough hold time to produce queue-spamming contention. The lock looked like the bottleneck in vol 1’s profile because the unconditional sweep was holding it for tens of microseconds per write. Once the sweep was gone, the lock stopped being the gating factor.
3.5. Why the default still has multiple shards
The honest read of this section is that the bench at the scale exercised doesn’t notice the difference between Shards=1 and Shards=16. Setting the default to 1 would simplify the model: one LRU, vol 1’s exact-LRU guarantee preserved. Two things kept the default at 16 anyway.
First, the structural argument. The bench runs at 8 to 32 concurrent connections in a Docker compose network. A real cache server faces hundreds or thousands. Lock-contention profiles are concurrency-shaped, and there’s a clean concurrency point past which a single mutex starts losing to a sharded one. The bench just doesn’t reach that point. Defaulting to 1 would hide a problem that reappears the moment anyone deploys this at meaningfully higher concurrency.
Second, the symmetry argument. Sharding sixteen ways doesn’t underperform a single mutex at any concurrency this bench exercises. The cost is sixteen sync.Mutex values, sixteen empty *list.Lists, and a constant inlined-FNV hash per operation. That cost is invisible in the profile. There’s nothing to gain from removing it.
The lesson has two parts. The pattern was right to add: it composes correctly with the rest of the store, the tests prove the per-shard policy is preserved, and the LRU relaxation is stated honestly. It is also not load-bearing at this bench’s scale, and the bench numbers say so directly. Both halves matter, and the discipline of measuring whether the pattern was needed is what makes the difference between a design that earns its complexity and one that just inherits it.
The next section returns to vol 1’s first named suspect from §5.2: the LRU MoveToFront that fires on every successful read and write, untouched so far. With the sweep gone and the lock split, that’s the largest piece of work still inside the critical section.
4. Rethinking LRU bookkeeping
The LRU bookkeeping vol 1 §5.2 named as a write-path suspect was always two costs: a list.Element allocation per new-key insert, and a MoveToFront pointer dance on every successful read and write. Replacing container/list with an intrusive doubly-linked list eliminates the allocation entirely and trims the pointer dance. The alloc profile confirms the cut. The bench at concurrency 8 and 32 doesn’t notice the change — same shape section 3 produced from sharding, and for the same reason.
4.1. The intrusive list
container/list wraps each value in a separately-allocated *list.Element whose Value field carries the key as interface{}. That works, but PushFront allocates an element per call, and reading the underlying key requires a pointer hop and a type assertion. The store can do better by embedding the list pointers directly on storedItem:
type storedItem struct {
key string
value []byte
expiresAt time.Time
size int
prev *storedItem
next *storedItem
}The shard owns head and tail pointers, and the operations are direct pointer rewrites:
type lruList struct {
head *storedItem
tail *storedItem
}
func (l *lruList) pushFront(item *storedItem) {
item.prev = nil
item.next = l.head
if l.head != nil {
l.head.prev = item
}
l.head = item
if l.tail == nil {
l.tail = item
}
}
func (l *lruList) moveToFront(item *storedItem) {
if l.head == item {
return
}
l.remove(item)
l.pushFront(item)
}No allocation, no type assertion, no separate element layer. The shard’s items map holds map[string]*storedItem so eviction can read the back-pointer’s key field directly to remove it from the map.
4.2. What changed in the shard
The structural improvement is on the write path. Previously, every Set on an existing key allocated a fresh list.Element for the refreshed-position node, even though the storedItem itself was overwritten in place in the map. With the intrusive list the storedItem itself is the LRU node, and an existing-key update mutates fields and moves the node:
expiry := expiresAt(now, ttl)
if existing, ok := sh.items[key]; ok {
// Reuse the storedItem and refresh recency; no new allocation.
hadTTL := !existing.expiresAt.IsZero()
existing.value = value
existing.expiresAt = expiry
existing.size = size
sh.lru.moveToFront(existing)
if hadTTL && expiry.IsZero() {
sh.ttlCount--
} else if !hadTTL && !expiry.IsZero() {
sh.ttlCount++
}
} else {
item := &storedItem{
key: key,
value: value,
expiresAt: expiry,
size: size,
}
sh.items[key] = item
sh.lru.pushFront(item)
if !expiry.IsZero() {
sh.ttlCount++
}
}The TTL accounting picked up a small wrinkle worth flagging: when an existing key is overwritten the new value’s TTL might differ from the old, and ttlCount has to track the delta. The four cases (“had TTL, has TTL”; “had, has none”; “none, has TTL”; “none, none”) all need correct handling or the post-§2 sweep-skip optimization quietly breaks. The branch above honors all four.
For new-key inserts the alloc count is exactly one: the *storedItem itself. That replaces vol 1’s pair of allocations (one list.Element plus the storedItem in the map’s bucket), so the count is the same. The storedItem is now heap-allocated separately because the LRU pointers need a stable address.
4.3. The alloc profile shift, the bench non-shift
The alloc profile after the change, against the same sustained workload as section 2’s measurement:
| flat% | sum% | cum% | function |
|---|---|---|---|
| 38.58% | 38.58% | 67.03% | server.execute |
| 17.86% | 56.45% | 32.87% | protocol.(*Parser).ReadCommand |
| 15.00% | 71.45% | 15.00% | strings.Fields |
| 12.86% | 84.31% | 12.86% | fmt.Sprintf |
| 11.25% | 95.57% | 11.25% | store.(*shard).get |
| 4.33% | 99.90% | 4.33% | store.(*shard).set |
container/list.(*List).insertValue is gone. It was 18.5% of total allocations in section 3’s profile. The shard’s set method holds 4.3% now, which is the per-new-key *storedItem heap allocation; the share dropped because existing-key updates now allocate nothing instead of one element each. The remaining alloc heads (server.execute, fmt.Sprintf, strings.Fields) all live on the read-path response side, outside this section’s scope.
The bench numbers at concurrency 8 and 32, three repeats per cell:
| concurrency | post-§3 ops/s | post-§4 ops/s |
|---|---|---|
| 8 (writes) | ~219,000 | ~208,000 |
| 8 (reads) | ~220,000 | ~210,000 |
| 32 (writes) | ~313,000 | ~316,000 |
All within ±5% noise. The alloc cuts were already sub-1% of CPU after section 2, and trimming them further can’t move throughput at this bench’s scale. Section 3’s lesson repeats with a slightly different pattern: the planned change landed cleanly, the alloc profile confirms the shift, and the bench says the change doesn’t matter at the workload exercised.
4.4. The door I left closed
A second LRU rewrite lives on the other side of one specific guarantee. Redis’s approximate-LRU policy samples K items at eviction time and evicts the oldest of the sample, with no recency bookkeeping on read or write at all. That’s the largest possible cut: the per-op MoveToFront disappears entirely, and the LRU pointers on storedItem go with it.
The cost is another step away from exactness. Section 3 already relaxed vol 1’s “exact LRU” to “exact LRU within a shard, approximate across the store.” Approximate-via-sampling would relax it again, to “approximate-within-shard via sampling, approximate across the store.” That’s a bigger guarantee shift, and the bench doesn’t say it’s needed at this scale.
For a hypothetical vol 3 the choice is interesting. At concurrency an order of magnitude past what this bench exercises, the constant per-op cost of moveToFront matters more than bookkeeping accuracy, and Redis-style sampling becomes the right answer. At this volume’s scale the trade isn’t worth the relaxation. The non-decision is on the record.
The last section turns the bench back on at vol 1’s exact published configuration, lays out the contributions per section, and reflects on which optimizations earned their complexity, which were structurally right but dormant at this scale, and what’s left on the table for whoever picks this up next.
5. The honest re-benchmark
After four sections of work, MiniKV writes ~47,900 operations per second at vol 1’s exact bench configuration, up from 16,900 when vol 1 closed. That matches Redis to within run-to-run noise and trails Memcached by about 5%. At concurrency 8 it writes ~215,600 ops/s, up from a regressing 14,400, ahead of Redis at the same concurrency and at roughly 77% of Memcached. Section 2’s removal of the unconditional sweep at the top of every Set produced almost all of that throughput delta. Sections 3 and 4 landed cleanly, shifted the alloc profile in predicted directions, and didn’t move the bench. All of that belongs in the post because all of it is what the build actually produced.
5.1. Final numbers, vol 1’s configuration
Same workload as vol 1 §5: 5 runs × 1000 keys × 128-byte values, against the Compose stack with MiniKV, Redis, and Memcached. Vol 2 numbers below are medians of three repeated runs. Vol 1’s are the published numbers from §5.2 for concurrency=1, and a re-bench against vol 1’s exact code via vol 2’s harness for concurrency=8 (vol 1’s harness was single-connection only).
| service | workload | vol 1, conc=1 | vol 2, conc=1 | vol 1, conc=8 | vol 2, conc=8 |
|---|---|---|---|---|---|
| minikv | write | 16,935 | 47,921 | 14,378 | 215,635 |
| minikv | read | 43,622 | 46,716 | 211,245 | 221,014 |
| redis | write | 50,654 | 46,190 | 123,470 | 125,836 |
| redis | read | 51,389 | 46,393 | 115,939 | 135,852 |
| memcached | write | 52,027 | 50,268 | 260,140 | 280,133 |
| memcached | read | 52,682 | 48,819 | 282,169 | 281,252 |
Why vol 1’s MiniKV regressed
The vol 1, conc=8 column shows what vol 1’s MiniKV does under contention it never previously exercised: writes regress, reads scale. The regressing-writes signature is exactly what §1’s profiling found.
Why Redis and Memcached drifted between volumes
The Redis and Memcached numbers shifted slightly between vol 1’s published table and this one, mostly within ±5%, partly because the host was different and partly because vol 2’s harness reports wall-clock-based ops/sec instead of vol 1’s count / sum(latencies). The shift is small enough not to change the cross-service story and the Redis and Memcached columns are still calibration anchors, not a competition.
Why Redis didn’t scale
Redis runs command execution on a single thread by design (no internal locking, no per-thread GC pauses, no shared-state coordination), which caps throughput at what one core can do. Memcached and MiniKV are both multi-threaded: Memcached has worker threads, MiniKV has Go goroutines behind a sync.Mutex now held for ~100ns post-§2. At conc=1 all three sit near 46-50k because the bottleneck is TCP round-trip latency, not CPU. At conc=8 the architectures diverge: Redis 2.7×, MiniKV 4.5×, Memcached 5.6×.
Two caveats keep this from being a Redis takedown. Redis 6+ has an opt-in io-threads N config (defaults to 1; the bench uses defaults) that parallelizes network I/O and typically lifts conc=8 throughput close to 200k ops/s. And real Redis workloads almost always pipeline commands across a connection, which lets a single core easily hit 1M+ ops/s by skipping the RTT cost between commands. MiniKV doesn’t pipeline at all. The honest read of the conc=8 column is “what default-configured, non-pipelined Redis does in this setup,” not “what Redis can do.”
5.2. Where the throughput came from
The post-section breakdown of MiniKV writes:
| stage | conc=1 ops/s | conc=8 ops/s |
|---|---|---|
| vol 1 closing | 16,935 | 14,378 |
| after §2 (sweep removed, ttlCount, ownership, parser/writer) | 46,118 | 216,191 |
| after §3 (sharding) | 46,000 | 219,000 |
| after §4 (intrusive LRU) | 47,921 | 215,635 |
Section 2 produced the entire throughput delta. The sweep on every Set was the bottleneck, and removing it took MiniKV from “regresses under contention” to “scales cleanly to ~215k writes/s with 8 connections.” After that, sections 3 and 4 hit a critical section already so short that the rest of the contention budget had moved out of MiniKV’s code and into the kernel TCP path.
5.3. What the volume actually taught
The post pitch was profiler-driven write-path optimization. The three suspects vol 1 named at its close were LRU bookkeeping that fires on every successful write, two allocations per Set, and a single global mutex. None of them was the primary cost. The actual primary cost was a divergence between what vol 1’s prose described as lazy expiry and what the code did. The code walked the entire item map at the top of every successful Set, looking for expired entries that, in the bench, didn’t exist.
That is the core lesson, and it generalizes. A perf post that says “I had three suspects, the profile confirmed all three, I cut all three, the bench got faster” is about confirming priors. A perf post that says “I had three suspects, the profile pointed at none of them, I cut what the profile actually pointed at, the bench moved by 15x” is about the practice the title gestured at: profiling first, optimizing second.
The other beat is the absence of a throughput result on sections 3 and 4. Sharding the global mutex (vol 1 §1.1‘s explicit “no sharding” non-decision, addressed) produced no measurable bench change at this scale. Replacing container/list with an intrusive linked list eliminated an entire alloc-profile head and produced no measurable bench change either. Both implementations are correct, both shifted their profiles in the predicted directions, and both are honest negative results at the workload exercised. MiniKV keeps the sharding pattern and the intrusive LRU because they compose with the rest of the system. The post shouldn’t pretend they were load-bearing when they weren’t.
5.4. What’s still on the table
Five things this volume didn’t touch, in rough order of size.
Read-path response writer. This is the largest remaining source of allocations: server.execute and fmt.Sprintf formatting the VALUE %s %d header hold ~50% of the post-§4 alloc profile between them. Cutting them is a few WriteString calls and one strconv.Itoa, structurally similar to section 2’s writeLine change. Untouched here because section 2 was scoped to the write path.
Approximate LRU via Redis-style sampling. The §4.4 closed door: drops the per-op MoveToFront to zero and the LRU pointers on storedItem with it, at the cost of relaxing exactness another step. Worth the trade at concurrency an order of magnitude past what this bench reaches; not worth the relaxation here.
Persistence. The store has none. Every restart drops state. A standard cache server would have at least an opt-in append-only log; the volume’s scope didn’t include it.
Pipelining. The current protocol is request-response per command. A pipelined client could batch commands across one TCP write and read the responses in batch, which is where a meaningful chunk of the post-§4 kernel-TCP overhead actually lives. A candidate for a future protocol-focused volume.
Higher-concurrency benchmarks. The bench tops out at 32 worker connections in a single-host Compose network. The contention story sharding addresses would show at concurrencies the bench doesn’t reach; demonstrating it convincingly needs a different harness and probably a different host setup.
5.5. The point of the series
Vol 1 said the goal was to make the hard parts visible. The hard parts of cache server correctness turned out to be the boundaries: expired-means-missing across every command, accounting that survives failed writes, eviction that prefers expired-then-LRU. The hard parts of cache server performance, at least at this implementation’s scale, turned out to be reading what the binary actually does versus what the prose says it does. Vol 1 said expiry was lazy. The code wasn’t. The profile said so in the first ten seconds of the first measurement, and almost everything else in this volume followed from that moment.
What MiniKV is now: a sharded, intrusive-LRU, profiler-instrumented cache server that matches Redis at low concurrency and runs ahead of it at moderate concurrency, in roughly 800 lines of Go. What it still isn’t: production-ready. The next section is the calibration coda on that, before the post closes.
6. What the bench doesn’t measure
The bench measures throughput on a single workload (fixed-size writes and reads through a TCP protocol) against three implementations of “key-value cache.” That is what it measures, and it is also nearly all of what it measures. Every other feature production caches are evaluated on lives outside the bench’s frame, and MiniKV has none of them.
- Durability: state lives only in memory; a process restart drops every key. Redis has AOF (append-only file) and RDB snapshots; Memcached has
ext-storefor SSD-backed entries. MiniKV has neither, by scope and by design. - Replication, HA, and clustering: one process, one host. There is no primary-replica replication, no failover, no horizontal sharding across nodes. Redis Sentinel handles failover; Redis Cluster handles horizontal sharding via hash slots. Relying on client-side consistent hashing across independent instances is an option though, like Memcached.
- Authentication and TLS: anyone reaching the listening port can read or write any key. Redis has ACL and AUTH (with per-user permissions in 6+); Memcached has SASL. Both support TLS termination. MiniKV has neither.
- Data types: keys point at byte blobs, with
incrinterpreting the blob as an unsigned integer when it parses. Redis has strings, lists, hashes, sets, sorted sets, streams, bitmaps, HyperLogLog, geo, and modules. Memcached’s value model is closer to MiniKV’s but adds CAS (check-and-set) tokens for optimistic concurrency, which MiniKV doesn’t have. - Multi-key operations: every command operates on exactly one key. Redis has
MGET,MSET,SUNION, multi-keyEXISTS; Memcached has multi-keyget. MiniKV has none. - Pub/sub, transactions, and scripting: Redis-only and all useful: pub/sub channels for fan-out,
MULTI/EXECfor atomic command groups, Lua scripting for server-side logic. MiniKV has none of the three (Memcached also doesn’t). - Operational surface: Redis has
INFO,MONITOR,SLOWLOG, latency tracking, runtimeCONFIG GET/SET. Memcached hasstats,stats slabs,stats items. MiniKV has slog and pprof. That is the entire surface.
The bench numbers say MiniKV pushes ~215k writes per second through a TCP loop at concurrency 8, ahead of Redis at the same workload and shape. They don’t say anything about what happens to those writes when the process dies, what happens to the connection when an unauthenticated client tries to scan the keyspace, or what happens to the data model when a workload needs anything other than a byte blob. The two production caches the bench compares against have answers for all three. MiniKV has answers for none (at least, yet).