Building a Key-Value Store From Scratch, Vol. 1
“Reinventing the Wheel” is a series where I build existing tools from scratch in their smallest demonstrable form, not to ship them but to understand how they actually work. This is the first one.
Most explanations of how cache servers work skip the parts that make them hard. They show you a hash map and call it a day. But a map is not a cache. A cache has a wire protocol that survives binary values without quoting rules. Expiration semantics have to stay consistent across every command. Memory limits need to fail writes cleanly without corrupting state, and the eviction policy has to have defined behavior, not just “we’ll drop something.”
MiniKV builds those pieces one at a time: a byte-counted TCP protocol, TTLs baked into the write path, explicit memory accounting, LRU eviction, and enough packaging to benchmark it against Redis and Memcached. The goal isn’t a production cache. It’s to make the hard parts visible.
1. Wire protocol and in-memory core
The first slice is not a tiny Redis. It’s a TCP server with one boring idea: a key points at a byte blob, and everything else can wait. That sounds underwhelming until you try to build the next layer on top of mush. Before expiration or eviction, you need two things you can actually trust: a precise definition of what a value is, and a clean boundary where network bytes become a typed operation.
1.1. Blobs, not data structures
Redis has strings, lists, hashes, sets, streams, sorted sets. Great product, rough first milestone. The simpler model here is closer to Memcached: keys are plain tokens, values are uninterpreted []byte. set replaces a blob, get returns it, delete removes it. The store does not know whether the bytes are text, JSON, or someone having a very bad day with serialization.
The one exception is incr, which treats the current blob as an unsigned base-10 integer. I kept it because counters are a real cache primitive, and because it forces crisp error boundaries: missing key, non-integer value, and overflow are not the same failure. If the store shrugs and returns “bad stuff happened”, the protocol becomes guesswork. Underneath: a map behind a mutex. No sharding, no clever allocator. The point is to get the semantics right before optimizing anything.
1.2. Byte-counted set is the protocol’s first real design choice
The protocol is text-based, but set is byte-counted:
set <key> <bytes>\r\n
<value>\r\nIf values are blobs, they can contain spaces, newlines, or anything else, which rules out “split on spaces and pray.” Instead, the command line declares how many bytes are coming. The parser reads exactly that many and expects a trailing CRLF. No quoting rules, no escaping, no special cases for binary content.
This isn’t trying to be Memcached-compatible, but it borrows the same lesson: keep the command line human-readable, keep the value path honest about bytes. You can debug it with nc without turning the parser into a quoting tutorial.
1.3. Keep parsing, execution, and storage separate
Three boundaries: parser, store, server. The parser owns the CRLF stream and produces typed commands. The store owns the map. The server sits between them: read a command, call the store, write the response, and that’s all it does. After ReadCommand() returns, the server isn’t re-parsing strings to figure out what the parser already knew. Protocol mistakes become client errors. Store outcomes become command responses. The socket loop doesn’t get to freelance.
This also made testing less painful. My first instinct was to test through net.Pipe, which quickly turned into a test about pipe behavior rather than protocol behavior. The better seam was an io.Reader, an io.Writer, and a store. Production uses the TCP connection; tests use strings and buffers.
1.4. The store copies byte slices on purpose
The store copies values on both Set and Get. Not free, but it gives the store a clean ownership rule: without a copy on Set, the parser could reuse an input buffer and silently corrupt a stored value; without a copy on Get, response-writing code could do the same in reverse. Both bugs are the kind that surface at 1 a.m. under load.
The rule: callers pass slices in and receive slices out, but they can’t alias the map’s contents. A known trade-off, revisited later.
2. Expiration and memory discipline
A cache that never forgets and never says “no” is just a liability with a friendlier protocol. Two constraints fix that: expiration says a value stops existing after a time, and memory limits say a write can fail before the process eats the machine. Both force the same question: what does the store actually own right now?
2.1. TTL belongs in the write path
Rather than a separate expire command, TTL is an optional field on set:
set <key> <ttl-seconds> <bytes>\r\n
<value>\r\nFor example, setting the key session:abc to the value hello with a 60-second TTL looks like this on the wire:
set session:abc 60 5\r\n
hello\r\nA separate expire command creates a weird halfway state where the value exists but its lifetime is maybe forthcoming. The cleaner rule is that a write defines the value and its lifetime together. ttl-seconds = 0 means no expiry. The store records a deadline as an expiresAt timestamp alongside the bytes.
2.2. Expired means missing
Once a key expires, every command treats it as gone: get returns END, incr returns NOT_FOUND, delete returns NOT_FOUND. That last one is the opinionated choice: you could argue for returning DELETED since the entry is still in the map. But if no reader can observe the value, reporting a successful delete is leaking an implementation detail.
Cleanup is lazy: when the store touches a key, it checks the deadline first and removes the entry if expired. This changes the locking shape. Since a read can now trigger a delete, the single mutex becomes necessary rather than optional.
2.3. Count memory explicitly, even if the number is imperfect
Each item tracks an accounted size: len(key) + len(value) + item-overhead-bytes. Not actual Go heap usage; that depends on bucket layout, slice headers, and allocator behavior I’m not going to pretend to know precisely. What matters is a consistent pressure signal: replace an item, subtract the old size and add the new one; delete or expire an item, subtract its size; if a write would push past the configured limit, reject it before touching the map.
The key invariant is that failed writes leave the store exactly as it was. No half-updated values, no accounting drift. Over the limit means SERVER_ERROR memory limit exceeded; a separate MaxValueBytes check rejects values too large to ever fit, regardless of current usage.
2.4. The cleanup loop lives at the process edge
Lazy expiry has one gap: a dead key nobody reads again still counts against accounted memory. So the store exposes CleanupExpired() and cmd/minikv runs it on a ticker. The goroutine lives in the process, not inside internal/store. Background work has lifecycle questions (when does it start, how does it stop, which logger does it use?) that are process concerns, not map concerns. Tests call CleanupExpired() directly with an injected clock, which beats sleeping and hoping.
3. Eviction policy
Memory limits that only reject writes are bounded storage, not a cache. Under pressure, the store should make space before giving up. That distinction, reject vs. evict, is what makes this slice interesting.
3.1. LRU
Exact least-recently-used: the store keeps a container/list alongside the item map, and each item points at its list node. Every recency refresh moves a node. Not free; this is pointer work on every read and write. But the behavior is easy to explain, easy to test, and makes the benchmark comparison against Redis and Memcached meaningful. The oldest untouched live item loses. No policy flag, because one defensible policy beats an options menu with nothing to choose between.
3.2. Expired entries lose before live entries
Eviction doesn’t replace expiration, it sits after it. On a memory-pressure write, the store sweeps expired entries first. If that clears enough space, no live item gets touched. Only then does LRU run over live items. This avoids the embarrassing outcome where a useful key gets evicted while expired entries are still sitting in the map counting against the limit.
Eviction reuses the same accounting as everything else: len(key) + len(value) + item-overhead-bytes. No second idea of memory.
3.3. Recency is part of command semantics
The rules, precisely:
- get → refresh if live; remove and return missing if expired
- set → refresh only on successful write
- incr → refresh only after all checks pass (parse, overflow, memory, eviction)
- delete → no recency update; remove item and subtract accounted bytes
The incr case bites you if you’re not careful. Incrementing 9 to 10 grows the stored blob by one byte, and with explicit accounting that counts as a write, which can trigger eviction. If eviction can’t make space, the counter stays at 9. More importantly: if a replacement can’t fit, the old value must survive intact. The implementation protects the key being updated while evicting others. Fail to make room, return SERVER_ERROR memory limit exceeded, leave the store unchanged.
4. Operational packaging
A cache nobody can run without a ritual isn’t useful. This slice makes the server start cleanly on a laptop, inside Docker, and in the exact shape the benchmark uses, because operational friction distorts what you measure.
4.1. Container defaults should work in containers
The default listen address changed from 127.0.0.1:11211 to 0.0.0.0:11211. Inside a container, binding to loopback means the process only listens within its network namespace. You can publish -p 11211:11211 all day and still fail to reach it from the host. The new default just works:
docker run --rm -p 11211:11211 mini-kv-storeLoopback is still available explicitly: -addr 127.0.0.1:11211. Same pattern for the rest of the flags: sane defaults for the packaged path, full control when you need it.
4.2. Health belongs in the protocol
ping returns PONG. The server is already a TCP protocol server, and adding an HTTP listener for health checks would mean another port, another parser, and a surface that tells you nothing about whether the real command path works. Instead:
printf 'ping\r\n' | nc 127.0.0.1 11211
PONGIf the parser is broken or the session loop is wedged, ping fails in exactly the same place a real command would. An HTTP health endpoint might stay green while the protocol path quietly rots.
4.3. The image should contain the server, not the toolbox
Two-stage Dockerfile: Go binary compiled statically in the build stage, copied into gcr.io/distroless/static-debian12:nonroot for the runtime. No shell, no nc, no debug tooling, which is the point. Health checks run ping from outside against the published port. Runtime images carry runtime dependencies; everything else belongs in build stages or on your laptop.
4.4. The Makefile is a shortcut, not a new interface
The Makefile is simple. Thin aliases for the underlying Go and Docker commands: fmt, test, build, run, docker-build, docker-run, and the benchmark stack targets.
fmt:
gofmt -w ./cmd ./internal
test:
go test ./...
run:
go run ./cmd/minikv -addr $(ADDR)
build:
go build -o bin/$(BINARY) ./cmd/minikv
docker-build:
docker build -t $(IMAGE) .
docker-run:
docker run --rm -p 11211:11211 $(IMAGE)
bench-stack-up:
docker compose up -d --build minikv redis memcached
bench-stack-down:
docker compose down
bench:
docker compose run --rm bench $(BENCH_ARGS)There is no grand task runner hidden here. make test runs go test ./.... make docker-build runs the Docker build. make run ADDR=127.0.0.1:11211 passes the same flag the binary already exposes. The bench-stack-* and bench targets are the same docker compose commands shown in section 5, kept here so the entry path stays consistent.
That is the useful level of abstraction for this stage. The Makefile shortens the entry path without creating a second operational language. If a command fails, the reader can copy the underlying Go or Docker command and keep moving.
5. Benchmark stack
The benchmark is a calibration tool, not a scoreboard.
It answers one question: when MiniKV, Redis, and Memcached run in the same local Docker network, how do repeated fixed-size writes and reads behave through their normal TCP protocols? It’s a calibration tool. It doesn’t measure persistence, pipelining, clustering, memory fragmentation, or the thousand things that make a production cache interesting. MiniKV is NOT a peer of Redis or Memcached; the point is just to have a real baseline to measure against.
5.1. Run every server the same way
Four Compose services: minikv, redis (7-alpine), memcached (1.6-alpine), and bench (golang:1.22-alpine). The client talks to every server by service name over container-to-container TCP, not host-to-container, which would introduce a different network path for each server and quietly corrupt the comparison.
docker compose up -d --build minikv redis memcached
docker compose run --rm bench -runs 5 -keys 1000 -value-bytes 128
docker compose downNo benchmark-specific binary, no special server mode. If the packaged path is slow or broken, the numbers should reflect that.
5.2. Measure a small workload and report distribution, not vibes
Five runs, 1,000 keys, 128-byte values.
The workload is simple. For each run, the client writes N keys with the same fixed-size payload, then reads those same keys back. It repeats that for the configured number of runs and records one latency sample per operation.
This is still a small benchmark. Fixed-size values avoid testing allocator behavior across varied payloads. Only sequential writes are performed, so concurrency isn’t a concern here.
A local Docker compose stack was used to serve the tools locally. Those limits are intentional because this section is about calibration, not theater.
| service | workload | count | min | mean | p50 | p95 | max | ops/sec |
|---|---|---|---|---|---|---|---|---|
| minikv | write | 5000 | 0.020ms | 0.059ms | 0.053ms | 0.104ms | 0.660ms | 16935.33 |
| minikv | read | 5000 | 0.017ms | 0.022ms | 0.021ms | 0.028ms | 0.169ms | 43622.29 |
| redis | write | 5000 | 0.017ms | 0.019ms | 0.018ms | 0.023ms | 0.296ms | 50654.21 |
| redis | read | 5000 | 0.017ms | 0.019ms | 0.018ms | 0.023ms | 0.433ms | 51389.37 |
| memcached | write | 5000 | 0.016ms | 0.019ms | 0.017ms | 0.023ms | 0.777ms | 52027.72 |
| memcached | read | 5000 | 0.017ms | 0.018ms | 0.017ms | 0.024ms | 0.189ms | 52682.11 |
Honestly, reads are not too bad: within ~25% of Redis and Memcached at p50 on this single-connection workload. Writes are roughly 3x slower. The gap has a few likely causes: every successful set moves a node in the LRU list regardless of memory pressure, writes do two allocations per operation (copy in, item struct) while reads do one, and Redis and Memcached have both accumulated years of write-path optimization that MiniKV simply hasn’t. How much of the gap is the LRU bookkeeping specifically versus the allocations are a profiler question for another day (perhaps a volume 2 can be dedicated for performance improvements).
The number that matters isn’t 16k writes/sec. It’s that the system is now measurable, the baseline is real, and the next change has something honest to compare against.
5.3. What actually makes a cache hard
The interesting thing about building this is where the complexity actually lives. It’s not in the map; that’s ten lines. It’s in the boundaries: expired means missing across every command, not just get. A failed write leaves state exactly as it was. Eviction and expiration share the same accounting rather than each inventing their own idea of memory.
Each rule is simple to state and surprisingly easy to violate as features accumulate. The TTL check that get does correctly gets skipped in incr. The defensive copy gets removed as a “premature optimization.” The cleanup goroutine moves inside the store because it’s “more convenient,” and suddenly tests depend on timing.
The next pieces (persistence, pipelining, concurrent clients) will stress the same boundaries in new ways. MiniKV now has a baseline to measure against, but more usefully, it has decent invariants.