$ emrebener
home personal-projects mini key-value store (minikv)

Mini Key-Value Store (MiniKV)

MiniKV is a from-scratch cache server in roughly 3,800 lines of Go, written to study how production caches actually work. It speaks the Memcached text protocol over TCP, so that any off-the-shelf Memcached client (libmemcached, gomemcache, pymemcache) can talk to it. It runs in a distroless Docker image, ships with a benchmark stack that compares it against Redis and Memcached on the same network, and across three volumes of build-and-write went from 17k writes/sec to ~215k at concurrency 8 — ahead of Redis at that shape, at roughly 77% of Memcached (Redis didn’t scale at conc=8 because it’s single-threaded, but that’s besides the point).

The wire protocol and the store

Keys are short ASCII tokens, values are uninterpreted byte blobs (similar to how Memcached works). set replaces a blob, get returns it, delete removes it — standard kv store functionality. incr interprets the stored blob as an unsigned base-10 integer.

The protocol is simply text on the command line:

set <key> <bytes>\r\n
<value>\r\n

Values are blobs, so they can contain spaces, newlines, or anything else. The command line declares a byte count, the parser reads exactly that many bytes, and a trailing CRLF (\r\n) closes the value. No quoting rules, no escaping, no special cases for binary content — and you can test the server with nc easily.

nc is netcat — the Unix utility that opens a raw TCP (or UDP) connection to an address and wires stdin/stdout to it. It’s the standard “manually poke a TCP server” tool.

The internals consist of three layers: parser, store, and server, where:

  • The parser owns the CRLF stream and produces typed Command values.
  • The store owns the items map and is mutex-guarded.
  • The server reads commands, calls the store, writes responses, and otherwise stays out of the way.

Memory, expiration, and eviction

A cache that never forgets and never says “no” is just a liability with a friendly protocol. MiniKV closes that with two constraints.

TTL is an optional field on set, not a separate expire command:

set <key> <ttl-seconds> <bytes>\r\n
<value>\r\n

ttl=0 means no expiry. A write defines value and lifetime together, never leaving them in a halfway state. Once a key expires, every command treats it as gone — get returns END, incr returns NOT_FOUND, even delete returns NOT_FOUND. Returning DELETED for an entry no reader could observe would leak the lazy-cleanup implementation through the protocol surface.

Memory is accounted explicitly: each item contributes len(key) + len(value) + item-overhead-bytes to a running total, checked against a configurable budget on every write. The invariant: failed writes leave state exactly as it was. If a value is too large to ever fit (MaxValueBytes), or exceeds the budget after eviction, the store rejects it before touching the map. No half-updates, no accounting drift.

Eviction is layered on top. When a write hits the limit, the store sweeps expired entries first; only if the budget is still exceeded does LRU run over live items. Same accounting throughout — there is no second idea of memory. Recency rules per command:

  • get → refresh if live
  • set → refresh only on a successful write
  • incr → refresh only after every check passes
  • delete → no refresh

If room can’t be made, the response is SERVER_ERROR memory limit exceeded and the original value survives intact.

Profile first, optimize second

Vol 1 closed at 17k writes/sec — about 36% of Memcached on the same workload — and named three suspects:

  • LRU bookkeeping firing on every successful write
  • Two allocations per Set
  • A single global mutex

Vol 2 was supposed to cut all three. Instead, profiling pointed at a fourth suspect that dominated everything else by an order of magnitude.

The first move was instrumentation, not code change. A -pprof-addr flag turned on net/http/pprof on a separate listener (never on the cache port — operational hygiene), and the bench got a -concurrency N knob plus an honest wall-clock-based ops/sec calculation. Both took ten lines.

The first ten-second CPU profile against a sustained eight-client write workload showed something none of the three suspects predicted: 88% of cumulative CPU lived in removeExpiredLocked, a six-line function called unconditionally at the top of every successful Set:

func (s *Store) removeExpiredLocked(now time.Time) {
    for key, item := range s.items {
        if item.expired(now) {
            s.deleteLocked(key, item)
        }
    }
}

Vol 1 had described expiration as lazy (“when the store touches a key, it checks the deadline”). The code was doing something different — an O(N) walk of the entire item map on every write, looking for expired entries that, in this bench, didn’t exist. The function ran with the store’s mutex held, so under concurrent load every writer queued up behind a 200,000-entry scan before doing actual work.

Removing that single call:

concurrencybeforeafter
117k/sec46k/sec
814k/sec215k/sec

A 15× lift on the contended write path, from finding and deleting one wrong line.

The rest of the volume is the calibration honest people owe the reader. With the sweep gone, the three originally-named suspects all got addressed:

  • Sharded mutex. 16 partitions with per-shard memory budgets and an inlined FNV-1a hash for routing. Public API unchanged. No measurable bench improvement at the workload exercised.
  • Intrusive LRU. Replaced container/list (which allocates a *list.Element per insert and stores keys as interface{}) with doubly-linked-list pointers embedded directly on storedItem. Per-update allocations dropped to zero; container/list.insertValue fell out of the alloc profile entirely. Still no measurable bench improvement.

Both shipped because they compose correctly with the rest of the system; neither was load-bearing at this scale, and the post says so directly. A perf write-up that pretends every planned optimization was a win is a write-up about marketing rather than measurement.

The lesson generalizes. The actual dominant cost was a divergence between what the prose said the code did and what the code actually did, and the only way to find it was to run the binary under measurement and read the result.

From process to service

The next layer of work turned the binary into something with operational shape.

The seven-flag command line moved into a key=value file parsed by a 30-line bufio.Scanner loop — no viper, no koanf, no YAML library. The project ships zero non-stdlib dependencies, on the grounds that pulling in 200KB of transitive imports to read seven primitive keys makes a learning project less legible. Unknown keys fail loudly with a line number — a typo’d max-memmory-bytes silently degrading to a default would be the worst class of config bug.

TLS terminates at the listener when both tls-cert and tls-key are set; there’s no third “tls = on” boolean, because the presence of the two paths is itself the switch.

AUTH is a new first-class command (AUTH <token>\r\n). Per-connection state is one bool, and token comparison goes through subtle.ConstantTimeCompare even though network jitter would probably hide a timing leak on a short token — it removes a footgun nobody should have to re-litigate per-deployment. The five behaviors:

server configfirst commandresponseconnection
no auth-tokenanynormal responsestays open
no auth-tokenauth ...CLIENT_ERROR auth not configuredclosed
auth-token setauth <correct>AUTHENTICATEDstays open
auth-token setauth <wrong>CLIENT_ERROR auth failedclosed
auth-token setanything elseCLIENT_ERROR auth requiredclosed

Two new commands round out the protocol. mget is a server-side loop over the existing Get with a single trailing END and absent keys silently skipped — same shape Memcached and Redis both ship.

CAS (compare-and-swap) is the more interesting one. The obvious-looking design has a sharp edge: a per-key version counter (start at 1, bump on each write) lets a deleted-and-recreated key alias the original key’s CAS token, so a client holding a stale token can accidentally CAS a value it has never seen. MiniKV uses a process-global atomic.Uint64 instead — the token uniquely identifies one specific value at one specific moment, and once minted is never reused. Cost: one atomic increment per successful mutation, which on x86 compiles to LOCK XADD. Single-digit-percent overhead against a write that already takes a few hundred nanoseconds.

Observability splits into three surfaces:

  • STATS over the wire — STAT <name> <value>\r\n lines in Memcached’s exact shape, so existing Memcached-aware tooling (the Prometheus memcached_exporter) reads it without modification.
  • /healthz — a one-handler HTTP probe that returns 200 if the listener accepted the request. That’s it. For liveness checks that just need to know the process responds.
  • /doctor — the deeper one. Snapshots stats, runs named checks (memory pressure, shard balance via max/min item ratio), returns 503 with failures named first if anything tripped.

Lifecycle hygiene gets three defenses:

  • idle-timeout enforced via SetReadDeadline before each ReadCommand. Conveniently bounds the TLS handshake too, since that runs inside the first read.
  • max-connections as a counted-semaphore cap that rejects rather than queues. SOMAXCONN is already the kernel’s queue; a userland queue on top is the same problem twice.
  • Graceful shutdown that fires an immediate SetReadDeadline on every active connection, so in-flight commands finish writing while idle ones exit through the same code path. Anything still alive after the configured shutdown-timeout gets force-closed.

Operational packaging

The Docker image is a two-stage build: the Go binary compiled statically, then copied into gcr.io/distroless/static-debian12:nonroot. No shell, no nc, no debug tooling. Health checks run ping from outside against the published port — if the parser is broken or the session loop is wedged, ping fails in exactly the same place a real command would. An HTTP health endpoint baked into the cache port would just be a separate parser that could stay green while the protocol path quietly rotted.

The benchmark stack is a docker compose file with four services — minikv, redis, memcached, and a bench Go client — talking to each other over container-to-container TCP. Not host-to-container, which would introduce a different network path for each server and quietly corrupt the comparison. There’s no benchmark-specific server mode and no “fast-path” build flag: the binary the bench measures is the same binary the Dockerfile produces.