docs(specs): add Staged Insert Specification#177
Conversation
Planned edits before mergeWhile discussing the staged-insert spec, we surfaced that Two edits are planned. They will be applied as a final commit on this branch immediately before merge, after the matching implementation PR in Edit 1 — Add a "Plugin codecs" sub-section to the Codec compatibility matrixAfter the built-in compatibility matrix (currently the last row is "Other custom codec"), add:
(Other plugins — Edit 2 — Replace the
|
Revised planned edits — superseding my previous comment@dimitri-yatsenko corrected my framing: That means my earlier "Edit 2: replace The real divide is array size, not codec preference. Revised plan: Revised Edit 1 — Add a "When to use staged insert" callout near the top of the specAdd this paragraph after the Overview, before §Scope:
Revised Edit 2 — Add
|
| Codec | Plugin package | Staged insert? | Notes |
|---|---|---|---|
<zarr@> |
dj-zarr-codecs |
Not typical | Use ordinary insert1 with a numpy or zarr array; the codec serializes to Zarr internally. Staged insert is rarely the right tool here — <zarr@>'s encode requires a materialized array. For streaming Zarr writes that don't fit in memory, use <object@> with staged.store(field, '.zarr') and open Zarr directly. |
This is more honest than putting <zarr@> in the "supported" column — it technically inherits the protocol from SchemaCodec, but it's not the right tool for the staged use case.
Dropped: the previous "replace the <object@> Zarr example" edit
The current <object@> Zarr example in the spec stays. It correctly shows the streaming pattern that <zarr@> can't serve. I'll just add a small note pointing readers to <zarr@> + insert1 for the in-memory case:
<object@>— Streaming Zarr / HDF5 / multi-file directoriesUse
<object@>when the data is built up incrementally and doesn't fit in memory. For Zarr arrays that do fit in memory, use<zarr@>with ordinaryinsert1instead — it's simpler and yields a typed fetch result.# [existing example unchanged]
Merge sequencing unchanged
These edits still land as one final commit just before merge, after the implementation PR in datajoint-python ships the generalized gate.
Test & validate
|
| Test | Asserts |
|---|---|
test_codec_admitted_by_staged_insert_gate |
A table whose field uses this codec accepts with table.staged_insert1 without raising. |
test_staged_write_lands_at_canonical_path |
After a clean exit, the written content exists at the path the codec returned (schema-addressed canonical, or hash-addressed canonical after the rename). |
test_staged_insert_metadata_matches_encode |
The metadata dict assigned to staged.rec[field] on finalization is structurally equal to what the same codec's encode() would produce for equivalent content. |
test_staged_insert_fetch_roundtrip |
After staged insert, fetching the field returns a value indistinguishable from what an ordinary insert1 of the same content would have produced. |
test_staged_cleanup_on_exception |
Raising inside the with block leaves no row inserted and no canonical artifact (and for hash-addressed codecs, no staging artifact). |
test_staged_primary_key_required |
Calling staged.open() or staged.store() before all primary key attributes are set on staged.rec raises DataJointError. |
Additional for hash-addressed codecs
| Test | Asserts |
|---|---|
test_staged_dedup_hit |
Two staged inserts of the same content to different primary keys produce one canonical hash-addressed object; both rows reference it. |
test_staged_concurrent_canonical_collision |
A staging-to-canonical rename whose destination is concurrently created falls through to the dedup branch without error. |
Part 2 — Implement the conformance tests in dj-zarr-codecs
Once the datajoint-python implementation PR merges, open a PR against dj-zarr-codecs that:
-
Bumps the
datajoint-pythonpin inpyproject.toml(pixi.toml) fromrev = "f4b02583251c"to the merged implementation commit. -
Adds
tests/test_staged_insert.pyimplementing the six SchemaCodec conformance tests above against<zarr@>:class TestZarrStagedConformance: def test_codec_admitted_by_staged_insert_gate(self, schema): ... def test_staged_write_lands_at_canonical_path(self, schema): ... def test_staged_insert_metadata_matches_encode(self, schema): ... def test_staged_insert_fetch_roundtrip(self, schema): ... def test_staged_cleanup_on_exception(self, schema): ... def test_staged_primary_key_required(self, schema): ...
Each test exercises a small
<zarr@>table, comparing staged-insert behavior against ordinaryinsert1for the same array. -
Adds
<zarr@>-specific tests beyond the generic conformance contract:Test Asserts test_staged_zarr_shape_dtype_recordedStaged-inserted <zarr@>metadata column containsshape,dtype,store, andprovenancematching what<zarr@>'sencode()would have produced.test_staged_zarr_chunked_write_roundtripOpen Zarr via zarr.open(staged.store(field, '.zarr')), write in chunks larger than memory budget, fetch, assert chunk-by-chunk equality. Demonstrates the streaming case<zarr@>was previously not designed for.test_zarr_insert1_still_worksRegression guard: the existing test_numpy_array_roundtrip/test_zarray_roundtriptests still pass after the gate generalization. (The<zarr@>insert1path is the idiomatic one for in-memory arrays — must not regress.) -
Confirms the codec compatibility matrix claim by running the conformance suite against both schema-addressed (
<zarr@>) and hash-addressed (a small<blob@>-style sanity test) codecs to prove the design isn't<zarr@>-specific.
Sequencing
| Step | Repo | Status |
|---|---|---|
| 1. Spec review | datajoint-docs #177 | this PR |
| 2. Implementation | datajoint-python | not yet opened — references the spec as design |
| 3. Spec edits (Zarr framing, conformance section) commit | datajoint-docs #177 | held until step 2 merges |
| 4. dj-zarr-codecs conformance tests | dj-zarr-codecs | held until step 3 merges |
| 5. Merge sequence: step 2 → 3 → 4 |
Steps 2 and 4 land sequentially because step 4 needs the implementation to pass. Step 3 (the spec) lands between them because the spec describes what step 2 shipped and what step 4 validates.
Deferred:
|
|
Read this carefully against the dj-python source — and caught up on On the hash algorithm (Open Decision #3). The decision says SHA-256 is spec'd "to match today's md5_digest = hashlib.md5(data).digest()
return base64.b32encode(md5_digest).decode("ascii").rstrip("=").lower()Two ways to reconcile:
On the hash-addressed canonical path. Spec line 162 gives On the
So three shapes today (encode, staged, spec). The conformance test will catch the impl-side divergence, but the spec itself should pick whichever is normative and signal that the others converge to it. On Small related: On forward-looking pieces. On staging vs canonical path consistency. Spec gives staging as On the cross-link to On the On the conformance contract (comment #3). The six required + two hash-addressed tests are well-scoped. One addition worth considering: a On None of this is showstopper — the spec's structure and the sequencing plan are both right. Mostly nudges around making "matches today" claims actually match today (or signaling that they're aspirational), and a couple of small forward-looking framing improvements. |
|
Thank you @MilagrosMarin — every claim you pulled from source was correct. Pushed ec3a0dd with corrections. Addressed in this commit
Tracked for the final pre-merge commit on this branchThese ship alongside the Zarr framing (already in
No action
Re-review whenever you have time. If you'd rather see the conformance section and the rejection-test now (rather than at final pre-merge), say the word and I'll fold them into this PR. |
|
Thanks @dimitri-yatsenko — verified ✅ Hash algo / canonical-path / On your question — defer the conformance section + rejection test to the final pre-merge commit. The spec is reviewable as a design doc now; the conformance section becomes meaningful only once the impl PR is concrete enough that the test names anchor to real assertions. Folding it in now risks drift between conformance and what ships. The PR reads well in its current state. |
Defines the staged-insert contract as a normative spec so the implementation has a single source of truth and third-party codec authors have a documented protocol to implement. Covers: - Lifecycle (setup → drafting → finalization → unwinding) - The codec-side staged-write protocol (staged_handle / finalize_staged / cleanup_staged on the Codec base class) - Two concrete lifecycle variants: schema-addressed (handle at canonical path, finalize computes metadata) and hash-addressed (handle at _staging path, finalize hashes content and renames to canonical _hash/ path with dedup) - Path-construction shapes for both addressing schemes - Per-codec metadata contracts (testable invariants matching each codec's encode() output) - Atomicity model (at-most-once with cleanup; not transactional) - Concurrency behavior (per-PK, hash dedup, transaction interaction, BaseException leakage) - Codec compatibility matrix (the four built-in object-store codecs in, in-table and reference codecs explicitly out) - Worked examples for <object@>, <npy@>, <blob@>, <attach@> - Future-work scope notes for filepath staging, multi-row variants, and resumable inserts Implementation is deferred to a follow-up PR in datajoint-python; this spec is the design that PR will reference. Nav: add under Reference → Specifications → Data Operations alongside data-manipulation.md and autopopulate.md.
Adds <zarr@> (from dj-zarr-codecs) as a first-class supported codec in the staged-insert spec: - New "Concrete protocol behavior" subsection describing both usage paths: ordinary insert1 (canonical for in-memory arrays) and staged_insert1 (for arrays too large to materialize, via direct FSMap-driven Zarr writes). - New row in the Codec compatibility matrix. - New Examples entry showing both paths side-by-side; demoted the generic <object@> example to a multi-file/directory fallback.
…ert spec
Corrections grounded in datajoint-python master:
- Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char
lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67).
- Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`;
corrected to `_hash/{schema}/{content_hash}` (flat) or
`_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching
hash_registry.build_hash_path. The {schema} segment is load-bearing for
isolation; subfolding is per-store-tunable.
- <object@> normative metadata shape: pinned to ObjectCodec.encode's actual
output `{path, store, size, ext, is_dir, item_count, timestamp}`
(builtin_codecs/object.py:166-174). Noted the two-place convergence work
the impl PR will do (StagedInsert._compute_metadata refactor; earlier
draft of this spec).
- <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and
AttachCodec.encode return raw bytes, and the dict shape comes from the
chained <hash@> codec — the impl PR refactors them to return dicts
directly. Also noted that HashCodec's three-way documented inconsistency
will be consolidated as part of the same refactor.
- Implementation-status banner: added at top of spec to signal which pieces
are forward-looking vs as-shipped, with source line numbers as anchors.
Items still in flight (planned for final pre-merge commit on this branch):
- Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs
per Milagros' suggestion)
- Cross-link sequencing vs PR #175 (how-to)
- aa0f66d Zarr framing edits (already in)
Every example in §Examples now includes the @Schema class declaration with definition string, matching the house style in codec-api.md. Readers can copy a complete, self-contained snippet rather than mentally fill in the table schema. int32 used throughout per the core-types-in-docs convention. Covers <zarr@> (both ordinary and staged paths), <object@>, <npy@>, <blob@>, <attach@>.
21b781a to
fb2b228
Compare
Summary
New normative spec at
src/reference/specs/staged-insert.mddefining the staged-insert contract for all object-store codecs (not just<object@>). The implementation indatajoint-pythonis deferred to a follow-up PR; this spec is what that PR will reference.What changed
src/reference/specs/staged-insert.md(new)mkdocs.yamlReference → Specifications → Data Operationssrc/reference/specs/data-manipulation.mdSpec contents
Codecbase class —staged_handle,finalize_staged,cleanup_staged. Default raisesDataJointErrorso non-participation is explicit.<object@>,<npy@>, future schema-addressed codecs._staging/path; finalize streams + hashes, moves to canonical_hash/path with dedup) — used by<blob@>,<attach@>,<hash@>.finalize_stagedreturns a dict structurally equal to what itsencode()would produce for the same content — testable invariant.BaseException.<object@>,<npy@>,<blob@>,<attach@>); in-table and reference codecs explicitly out.<filepath@>staging, multi-row variants, resumable inserts).Open spec decisions surfaced for your review
These are spec-level decisions I made with my best judgment; flag any you'd push back on:
<attach@>filename API: spec'd asstaged.set_filename(field, 'name.ext')parallel tostaged.rec[k] = v. Alternative was a convention likestaged.rec[f'{field}_filename']. The explicit helper avoids name-collision risk with real attributes._staging/{schema}/{table}/{field}_{token}{ext}. Orphans are traceable to their schema/table for GC. Alternative was a flat_staging/{token}.<blob@>/<hash@>behavior. Worth committing to in the spec so plugin authors don't accidentally pick something else.KeyboardInterrupthandling: spec says "not caught; orphans reclaimed by GC." Did not add a recommendation to mask SIGINT — felt like over-prescription.What this PR does not do
datajoint-python. The implementation PR will land after this spec is merged.src/how-to/staged-insert.md— that's PR #175, still open). Once the implementation lands, a follow-up PR will update the how-to to reference this spec for normative details.Test plan
mkdocs serveand confirm the new spec page renders underReference → Specifications → Data Operationscodec-api.md,data-manipulation.md,type-system.md,object-store-configuration.md, the how-to, garbage-collection)<object@>behavior today (run a quick<object@>smoke test against currentmaster)