feat(dataset): decentralized fine-tuning — dataset subsystem (P2–P6)#641
Open
bussyjd wants to merge 6 commits into
Open
feat(dataset): decentralized fine-tuning — dataset subsystem (P2–P6)#641bussyjd wants to merge 6 commits into
bussyjd wants to merge 6 commits into
Conversation
…(P2)
The serving core for type=dataset offers: a signed, hash-chained version
log plus an owner-hosted HTTP server that streams versioned dataset
artifacts to paying or owner-admitted members. No bytes leave the host
un-gated.
internal/dataset/:
- versionlog.go signed (secp256k1), hash-chained DatasetVersion log over a
canonical, length-prefixed, domain-separated digest; offline Verify walks
v1..head rejecting reorder / tamper / middle-removal. signer.go is the
secp256k1 adapter (same key kind as on-chain registration; no new custody).
- store.go atomic temp+rename JSON persistence of {versions,
entitlements} so a continuously-served dataset survives a restart.
- entitlement.go token-hash -> max-version map enforcing the version scope
the bare membership gate cannot express.
- artifacts.go file-backed, seekable artifact source (Range-ready).
- server.go device-auth (reuses groupauth) + payment-minted member
tokens + entitle() gate (membership AND version) + Range/206 download via
http.ServeContent, sending the whole-file hash on 200 and 206 alike.
groupauth gains additive Mint / RegisterHash / HashToken so a settled
payment (not an owner approval) can mint a version-scoped token, and so the
server can rehydrate paying members by hash after a restart.
Tests: table-driven + httptest + a fuzz over the canonical digest + -race;
85% coverage. Version metadata reaches buyers via the 402 extra.dataset
wired in P1.
obol dataset from|version|publish|approve|verify|status and obol buy dataset, wiring the P2 server into the CLI. - internal/dataset/client.go Fetch: Range-resumable download that verifies the whole-file SHA-256 against the server's X-Dataset-File-Hash commitment; refuses an unverifiable download and rejects a hash mismatch without ever finalizing the output file. - bundle.go reads a bundle dir's manifest.json -> manifestHash + the training artifact's whole-file hash + size (generic manifest envelope). - keyfile.go load-or-create the owner's secp256k1 signing key (0600). - cmd/obol/dataset.go owner commands (ingest a bundle as a signed version, host the artifact server + Cloudflare tunnel, admit workers, walk the signed chain offline, show status) + `obol buy dataset`. Local end-to-end validated: from -> publish -> buy yields a byte-identical, hash-verified artifact; append produces a distinct signed version; offline verify walks the chain. Dataset package coverage ~84%.
A pluggable privacy stage that strips PII from a dataset's JSONL before it is published or sold. Default is a dependency-free regex redactor (emails, IPs, keys, card/SSN-shaped numbers, home paths, phones); set OBOL_ANONYMIZER_MODEL to a Hugging Face token-classification PII model for ML-grade detection (cached under the obol data dir via HF_HOME). Typed, deterministically-indexed placeholders keep cross-message references linkable without revealing the underlying values. Embedded skill: internal/embed/skills/dataset-anonymize/. Validated: masks email/IP/key/path/card with no raw leak; deterministic indexing across records; embed skills test green.
One contract (run(dataset, base_model, hyperparams) -> adapter + eval + run.manifest) over mock / mlx-lora / unsloth / axolotl / torchtune backends, selected by --backend. Every backend reads the same sft.jsonl artifact. The runner binds each result to the dataset's content-address (manifestHash) in run.manifest — the provenance link from a fine-tuned model back to the data it trained on. The shared idea across real backends is "regex-extract the eval metric from stdout"; --dry-run validates the dataset without invoking a backend; the mock backend validates the contract + provenance anywhere. Embedded skill: internal/embed/skills/finetune-backend/. Validated: mock run emits eval.json + run.manifest with dataset_hash == manifestHash; dry-run path; embed test green.
End-to-end user guide: anonymize -> record a signed version -> publish (host + tunnel + membership/version gate) -> federated discovery -> buy (verifying download) -> fine-tune with provenance. Discovery is the existing type-agnostic catalog: a priced dataset surfaces in /api/services.json with its pinned version metadata and the obol-router federates it unchanged (the catalog-entry shape is schema-validated in the type=dataset catalog tests). No central hub — each operator owns their dataset. docs/guides/monetize-dataset.md.
flows/hf-surface-smoke.sh validates the "decentralised HF" surfaces end to end: (1) dataset hub anonymize -> sign -> publish -> buy (resumable, hash-verified); (2) inference + dataset offers in a federated catalog; (3) fine-tune on a real GPU box (spark) with run.manifest bound to the bought dataset's content-address; (4) discovery via obol-router federation; (5) cross-check against the obol-exex ERC-8004 indexer. Each surface is independent (missing prereq -> SKIP). plans/dataset-subscription-v1.1-pitch.md is the held-for-later, diagram-based pitch for continuous escrow subscriptions (reserve multi-deadline voucher -> capture per shipped version -> void the tail), honest about the new per-epoch capture wiring it needs and its dependency on the payout leg.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Decentralized fine-tuning — dataset subsystem (P2–P6)
Turns a versioned dataset into an owned, monetizable product: sign it, host it behind a membership+version gate, sell it over HTTP with resumable hash-verified downloads, anonymize it before it leaves the host, fine-tune on it with provenance, and discover it across stacks. This is P2–P6 of the decentralized fine-tuning plan; P1 (the
type=datasetcatalog/CRD side) is #640, the complementary half.What's in it (
internal/dataset/)versionlog.go+signer.goVerifywalks v1→head rejecting reorder/tamper/middle-removal. No new key custody (same key kind as on-chain registration).store.go/entitlement.go/artifacts.go{versions, entitlements}; token-hash→max-version map enforcing the version scope the bare membership gate can't; seekable file-backed artifact source.server.gogroupauth) + payment-minted member tokens +entitle()gate (membership and version) + Range/206download viahttp.ServeContent, sending the whole-file hash on200and206alike. Member tokens persist by hash → survive restart (no re-pay).client.go+cmd/obol/dataset.goobol dataset from/version/publish/approve/verify/status+obol buy dataset: a Range-resumable download that verifies the whole-file SHA-256 against the server's commitment and fails closed on mismatch / missing hash.dataset-anonymizeOBOL_ANONYMIZER_MODELfor ML-grade detection (cached under the obol data dir). Deterministic typed placeholders.finetune-backendmock/mlx-lora/unsloth/axolotl/torchtune;run.manifestbindsdataset_hashto the bought version — provenance from a model back to its data.docs/guides/monetize-dataset.mdgroupauthgains additiveMint/RegisterHash/HashTokenso a settled payment (not an owner approval) can mint a version-scoped token, and the server can rehydrate paying members by hash after a restart.Validation
httptest+ a fuzz over the canonical digest, all under-race; dataset package 85% coverage.flows/hf-surface-smoke.sh— 5/5 surfaces green, the full "decentralized HF" loop end to end:run.manifestdataset_hash== the bought version'smanifestHashgo build ./...,go vet, full affected-packagego test ./...green.Held for later
plans/dataset-subscription-v1.1-pitch.md— continuous escrow subscriptions (reserve → capture-per-shipped-version → void), gated on the batch-settlement payout leg.Note on the router
Surfacing datasets through obol-router's federated catalog needed a 1-line aggregator change (
discoveryEntries()— pass non-routable offers through while keeping them out of chat-routing). That lives with the router PR's codebase, not here.