docs: dedicated how-to page for staged insert#175
Conversation
`staged_insert1` was buried in `use-object-storage.md` (subsection "Write Directly to Object Storage") with no nav entry, so readers scanning the How-To sidebar couldn't find it and `insert-data.md` didn't link to it. Promote it to its own how-to: - New `src/how-to/staged-insert.md` covering overview, atomicity guarantees, full API (`rec`, `store`, `open`, `fs`), Zarr/HDF5/ streaming patterns, limitations, and troubleshooting. - Add nav entry under Object Storage in `mkdocs.yaml`. - Replace the section in `use-object-storage.md` with a short pointer and link to the new page. - Cross-link from `insert-data.md` (inline after "Insert with Blobs" and in See Also).
MilagrosMarin
left a comment
There was a problem hiding this comment.
Thanks @dimitri-yatsenko! Verified the API surface and atomicity contract against src/datajoint/staged_insert.py and tests/integration/test_object.py::TestStagedInsert:
✅ API matches source — staged.rec, staged.store(field, ext='') → fsspec.FSMap, staged.open(field, ext='', mode='wb') → IO, staged.fs → AbstractFileSystem.
✅ Atomicity is accurately described end-to-end. Worth highlighting: the doc correctly says database-insert failure during exit also cleans up — confirmed because _finalize() is called inside the outer try (staged_insert.py:304-310), so a duplicate-PK error gets caught and _cleanup() runs.
✅ Cross-links from insert-data.md and use-object-storage.md solve the discoverability gap. Nav placement under "How-To → Object Storage" is right.
✅ The pattern of not explicitly assigning staged.rec['field'] = z (which is shown in the dj-python source docstring) is actually more correct than the source docstring — _finalize overwrites it with computed metadata anyway. Good call.
Two accuracy issues worth fixing before merge:
1. <blob@> support claim is incorrect. Line 15 says:
It is only available for object-typed fields (
<...@>syntax) and codecs that support direct storage handles — primarily<object@>(Zarr / HDF5 / multi-file) and<blob@>written via a file handle.
But staged_insert.py:100-101 explicitly:
if not (attr.codec and attr.codec.name == "object"):
raise DataJointError(f"Attribute '{field}' is not an <object> type")Only <object@> passes. The tests only exercise <object@local>. A <blob@> field would raise today. Either drop the <blob@> clause or note it as a planned extension.
2. "Content hash for hash-addressed codecs" describes nothing the implementation does. Line 64:
Computes object metadata (size, manifest, content hash for hash-addressed codecs) from the staged objects.
_compute_metadata (:165-249) sets "hash": None in both the directory and single-file branches. Since the only supported codec is <object@> (path-addressed), no hash is ever computed. Suggest just "Computes object metadata (size, manifest) from the staged objects."
Minor (optional):
<object@store>named-store form isn't mentioned — Quick Start uses<object@>(default store), but tests use<object@local>. One-line note in API Reference would round it out.- The context manager catches
Exception, notBaseException, soKeyboardInterruptmid-write leaves staged objects behind. Worth a bullet in Limitations so readers know not to rely on cleanup for Ctrl+C scenarios.
Otherwise this is a clean, well-structured how-to — much better than the buried subsection. Happy to approve once the two accuracy items are fixed.
From @MilagrosMarin's review on #175: - Drop the inaccurate '<blob@> written via a file handle' claim. staged_insert.py:100-101 explicitly rejects anything except codec name == 'object', so only <object@> is supported. Note the actual error behavior instead. - Drop 'content hash for hash-addressed codecs' from the metadata list. _compute_metadata always sets hash: None for both directory and single-file branches; no hash is ever computed. - Mention the named-store form '<object@name>' alongside '<object@>' in the Table.staged_insert1 API reference. - Add a Limitations bullet noting that cleanup catches Exception not BaseException, so KeyboardInterrupt mid-write can leave staged objects behind; point to the garbage-collection how-to.
|
Thanks @MilagrosMarin — pushed e0f4de0 addressing all four points. Required fixes:
Optional (both done): Ready for another look. |
|
@MilagrosMarin — thanks again for the careful read here. Your accuracy points led directly to a broader design question that's worth flagging: once we asked why only I've opened a parallel spec PR that pins this down end-to-end: #177 — Staged Insert Specification. It defines the lifecycle, the codec-side protocol ( For #175: it's correct as-is for the current |
MilagrosMarin
left a comment
There was a problem hiding this comment.
Thanks @dimitri-yatsenko — all four points verified in e0f4de0. <blob@> claim dropped, "content hash" phrase dropped, named-store form added, BaseException/Ctrl+C caveat in Limitations with pointer to garbage-collection.md (target exists). Spec-PR scoping (#177) is sensible — will review separately.
Approving.
…ert spec
Corrections grounded in datajoint-python master:
- Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char
lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67).
- Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`;
corrected to `_hash/{schema}/{content_hash}` (flat) or
`_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching
hash_registry.build_hash_path. The {schema} segment is load-bearing for
isolation; subfolding is per-store-tunable.
- <object@> normative metadata shape: pinned to ObjectCodec.encode's actual
output `{path, store, size, ext, is_dir, item_count, timestamp}`
(builtin_codecs/object.py:166-174). Noted the two-place convergence work
the impl PR will do (StagedInsert._compute_metadata refactor; earlier
draft of this spec).
- <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and
AttachCodec.encode return raw bytes, and the dict shape comes from the
chained <hash@> codec — the impl PR refactors them to return dicts
directly. Also noted that HashCodec's three-way documented inconsistency
will be consolidated as part of the same refactor.
- Implementation-status banner: added at top of spec to signal which pieces
are forward-looking vs as-shipped, with source line numbers as anchors.
Items still in flight (planned for final pre-merge commit on this branch):
- Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs
per Milagros' suggestion)
- Cross-link sequencing vs PR #175 (how-to)
- aa0f66d Zarr framing edits (already in)
…ert spec
Corrections grounded in datajoint-python master:
- Hash algorithm: spec said sha256/hex; corrected to MD5+base32 → 26-char
lowercase token, matching hash_registry.compute_hash (hash_registry.py:51-67).
- Hash-addressed canonical path: spec said `_hash/{h[:2]}/{h[2:4]}/{h}`;
corrected to `_hash/{schema}/{content_hash}` (flat) or
`_hash/{schema}/{fold_*}/{content_hash}` (subfolded), matching
hash_registry.build_hash_path. The {schema} segment is load-bearing for
isolation; subfolding is per-store-tunable.
- <object@> normative metadata shape: pinned to ObjectCodec.encode's actual
output `{path, store, size, ext, is_dir, item_count, timestamp}`
(builtin_codecs/object.py:166-174). Noted the two-place convergence work
the impl PR will do (StagedInsert._compute_metadata refactor; earlier
draft of this spec).
- <blob@>/<attach@> shape: clarified that today's BlobCodec.encode and
AttachCodec.encode return raw bytes, and the dict shape comes from the
chained <hash@> codec — the impl PR refactors them to return dicts
directly. Also noted that HashCodec's three-way documented inconsistency
will be consolidated as part of the same refactor.
- Implementation-status banner: added at top of spec to signal which pieces
are forward-looking vs as-shipped, with source line numbers as anchors.
Items still in flight (planned for final pre-merge commit on this branch):
- Conformance test section (incl. new test_staged_handle_rejects_non_participating_codecs
per Milagros' suggestion)
- Cross-link sequencing vs PR #175 (how-to)
- aa0f66d Zarr framing edits (already in)
Summary
staged_insert1is the right tool for inserting multi-GB Zarr/HDF5/streaming objects with atomic database+storage semantics, but it had no nav entry and no cross-link frominsert-data.md— it was a subsection insideuse-object-storage.mdthat readers only found by scrolling. Promote it to a dedicated How-To with full coverage.Changes
src/how-to/staged-insert.md— overview, "How it Works" (atomicity contract on both clean exit and exception), full API (rec,store,open,fs), patterns for Zarr/HDF5/instrument streaming, limitations, troubleshooting, See Also.How-To → Object Storage, sandwiched betweenUse Object StorageandUse NPY Codec.use-object-storage.md: replaced the "Write Directly to Object Storage" subsection with a brief pointer + link, so readers in the object-storage flow still get the handoff but the canonical content lives on the new page.insert-data.md: inline note after "Insert with Blobs" and an entry at the top of See Also, so readers landing on the standard insert page discover the right tool for large objects.Why a dedicated page (not a more prominent subsection)
Per offline discussion: the topic is valuable enough to warrant its own discoverability surface. Keeping it as a subsection — even renamed and repositioned — wouldn't appear in the left sidebar TOC, which is the highest-traffic discovery path.
Test plan
mkdocs serveand verifyStaged Insertappears in the left sidebar under How-To → Object Storage/how-to/staged-insert/— confirm page renders end-to-end, all code blocks syntax-highlight, See Also links resolve/how-to/use-object-storage/— confirm the pointer near the bottom links to the new page/how-to/insert-data/— confirm both the inline "For multi-GB arrays..." and the See Also entry link to the new pagemkdocs-linkcheck(or equivalent) clean