Skip to content

Adopt binary protobuf encoding for control-plane records (Actor/Worker) #307

Description

Summary

Switch the on-the-wire encoding of Actor and Worker records in the Valkey state store from protojson to binary protobuf. Pre-alpha, no production data, so this is a clean cutover with no migration path needed. This is the cheapest, highest-leverage step toward the 1 billion record north star: it roughly halves per-record memory and is ~4x faster to encode, with no schema changes and no change to the locking path.

Motivation

The store keeps every Actor/Worker record resident in RAM with no TTL, so per-record size directly sets the capacity bill. At the 1B target, encoding alone moves the provisioned fleet from ~3.8 TB to ~1.4 TB (~2.2 TB saved).

Measurements (full methodology and all five encodings in docs/record-encoding-benchmarks.md):

protojson (today) binary protobuf binary + field trims
Value size 609 B ~321 B 170 B
In-Valkey (w/ overhead) 731 B n/a 283 B
Encode CPU 3124 ns 721 ns (4x) ~same

Binary protobuf is the necessary-but-not-sufficient first step: a constant-factor win that extends runway. It does not change the cost class of holding 1B mostly-idle records in RAM. That is the job of hot/cold tiering, tracked separately (see #12).

Proposed change

All in cmd/ateapi/internal/store/ateredis/ateredis.go:

  • Write path: protojson.Marshal to proto.Marshal for Actor and Worker records (Create/Update).
  • Read path: protojson.Unmarshal to proto.Unmarshal.

Clean cutover. Since there's no production data, any existing dev keyspace is flushed (records are a re-derivable cache anyway). The version-check logic stays in Go, so the WATCH/MULTI optimistic-concurrency paths are unaffected. The lock-release Lua (ateredis.go:578) is also unaffected: it compares an opaque, ephemeral lock token by byte equality and never touches the record encoding, so the change is invisible to it (verified against a real Valkey cluster, including a token carrying an embedded NUL byte).

Non-goals / out of scope

  • No change to the lock release Lua (ateredis.go:578). The compare-and-delete is opaque byte equality and is binary-safe (verified against a real Valkey cluster, including an embedded NUL byte in the token).
  • Field trims (170 B / 283 B) are a separate phase. They involve semantic decisions about droppable/derivable fields and carry regression risk.
  • zstd + dictionary (112 B) is a different tradeoff (~1 ms/record encode, ~1400x), only justified if RAM becomes the hard binding constraint.
  • Hot/cold tiering and any transactional-store direction are tracked elsewhere.

Acceptance criteria

  • Actor and Worker records are written and read as binary protobuf.
  • Round-trip equivalence test against a real Valkey cluster (not just miniredis) for both record types.
  • No change to store.Interface; existing store tests pass unmodified.
  • Lock acquire/release behavior unchanged.

Risks

  • Hard cutover: any pre-existing dev data won't decode, so flush keyspaces on deploy of this change. Acceptable pre-alpha; must land before any production use.
  • Debuggability: values in valkey-cli are no longer human-readable.
  • Test coverage gap: miniredis can't run cluster commands, so add at least one real-cluster round-trip test.

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions