Skip to content

feat(dataset): decentralized fine-tuning — dataset subsystem (P2–P6)#641

Open
bussyjd wants to merge 6 commits into
feat/decentralized-auto-researchfrom
feat/dataset-subsystem
Open

feat(dataset): decentralized fine-tuning — dataset subsystem (P2–P6)#641
bussyjd wants to merge 6 commits into
feat/decentralized-auto-researchfrom
feat/dataset-subsystem

Conversation

@bussyjd

@bussyjd bussyjd commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Decentralized fine-tuning — dataset subsystem (P2–P6)

Turns a versioned dataset into an owned, monetizable product: sign it, host it behind a membership+version gate, sell it over HTTP with resumable hash-verified downloads, anonymize it before it leaves the host, fine-tune on it with provenance, and discover it across stacks. This is P2–P6 of the decentralized fine-tuning plan; P1 (the type=dataset catalog/CRD side) is #640, the complementary half.

Base: this stacks on #639 (it reuses internal/research/groupauth for member tokens — the dataset server imports groupauth, not P1's CRD). The only shared file is groupauth.go (+50 lines: additive Mint/RegisterHash/HashToken). When #639 merges, rebasing onto main drops nothing from this diff (it's already P2–P6 only). Related: #640 (catalog/type=dataset).

What's in it (internal/dataset/)

Piece What it does
P2 versionlog.go + signer.go secp256k1-signed, hash-chained version log over a canonical length-prefixed, domain-separated digest; offline Verify walks v1→head rejecting reorder/tamper/middle-removal. No new key custody (same key kind as on-chain registration).
P2 store.go / entitlement.go / artifacts.go atomic temp+rename persistence of {versions, entitlements}; token-hash→max-version map enforcing the version scope the bare membership gate can't; seekable file-backed artifact source.
P2 server.go device-auth (reuses groupauth) + payment-minted member tokens + entitle() gate (membership and version) + Range/206 download via http.ServeContent, sending the whole-file hash on 200 and 206 alike. Member tokens persist by hash → survive restart (no re-pay).
P3 client.go + cmd/obol/dataset.go obol dataset from/version/publish/approve/verify/status + obol buy dataset: a Range-resumable download that verifies the whole-file SHA-256 against the server's commitment and fails closed on mismatch / missing hash.
P4 skill dataset-anonymize PII masking before sale — built-in regex redactor by default, BYO OBOL_ANONYMIZER_MODEL for ML-grade detection (cached under the obol data dir). Deterministic typed placeholders.
P5 skill finetune-backend one contract over mock/mlx-lora/unsloth/axolotl/torchtune; run.manifest binds dataset_hash to the bought version — provenance from a model back to its data.
P6 docs/guides/monetize-dataset.md E2E guide.

groupauth gains additive Mint/RegisterHash/HashToken so a settled payment (not an owner approval) can mint a version-scoped token, and the server can rehydrate paying members by hash after a restart.

Validation

  • Unit: table-driven + httptest + a fuzz over the canonical digest, all under -race; dataset package 85% coverage.
  • flows/hf-surface-smoke.sh — 5/5 surfaces green, the full "decentralized HF" loop end to end:
    1. dataset hub: anonymize → sign → publish → buy (byte-identical, hash-verified)
    2. inference + dataset offers in a federated catalog
    3. fine-tune on a real GPU box (aarch64) — run.manifest dataset_hash == the bought version's manifestHash
    4. discovery via obol-router federation
    5. cross-check via the obol-exex ERC-8004 indexer
  • go build ./..., go vet, full affected-package go test ./... green.

Held for later

plans/dataset-subscription-v1.1-pitch.md — continuous escrow subscriptions (reserve → capture-per-shipped-version → void), gated on the batch-settlement payout leg.

Note on the router

Surfacing datasets through obol-router's federated catalog needed a 1-line aggregator change (discoveryEntries() — pass non-routable offers through while keeping them out of chat-routing). That lives with the router PR's codebase, not here.

bussyjd added 6 commits June 14, 2026 17:15
…(P2)

The serving core for type=dataset offers: a signed, hash-chained version
log plus an owner-hosted HTTP server that streams versioned dataset
artifacts to paying or owner-admitted members. No bytes leave the host
un-gated.

internal/dataset/:
- versionlog.go  signed (secp256k1), hash-chained DatasetVersion log over a
  canonical, length-prefixed, domain-separated digest; offline Verify walks
  v1..head rejecting reorder / tamper / middle-removal. signer.go is the
  secp256k1 adapter (same key kind as on-chain registration; no new custody).
- store.go       atomic temp+rename JSON persistence of {versions,
  entitlements} so a continuously-served dataset survives a restart.
- entitlement.go token-hash -> max-version map enforcing the version scope
  the bare membership gate cannot express.
- artifacts.go   file-backed, seekable artifact source (Range-ready).
- server.go      device-auth (reuses groupauth) + payment-minted member
  tokens + entitle() gate (membership AND version) + Range/206 download via
  http.ServeContent, sending the whole-file hash on 200 and 206 alike.

groupauth gains additive Mint / RegisterHash / HashToken so a settled
payment (not an owner approval) can mint a version-scoped token, and so the
server can rehydrate paying members by hash after a restart.

Tests: table-driven + httptest + a fuzz over the canonical digest + -race;
85% coverage. Version metadata reaches buyers via the 402 extra.dataset
wired in P1.
obol dataset from|version|publish|approve|verify|status and obol buy
dataset, wiring the P2 server into the CLI.

- internal/dataset/client.go  Fetch: Range-resumable download that verifies
  the whole-file SHA-256 against the server's X-Dataset-File-Hash commitment;
  refuses an unverifiable download and rejects a hash mismatch without ever
  finalizing the output file.
- bundle.go   reads a bundle dir's manifest.json -> manifestHash + the
  training artifact's whole-file hash + size (generic manifest envelope).
- keyfile.go  load-or-create the owner's secp256k1 signing key (0600).
- cmd/obol/dataset.go  owner commands (ingest a bundle as a signed version,
  host the artifact server + Cloudflare tunnel, admit workers, walk the
  signed chain offline, show status) + `obol buy dataset`.

Local end-to-end validated: from -> publish -> buy yields a byte-identical,
hash-verified artifact; append produces a distinct signed version; offline
verify walks the chain. Dataset package coverage ~84%.
A pluggable privacy stage that strips PII from a dataset's JSONL before it
is published or sold. Default is a dependency-free regex redactor (emails,
IPs, keys, card/SSN-shaped numbers, home paths, phones); set
OBOL_ANONYMIZER_MODEL to a Hugging Face token-classification PII model for
ML-grade detection (cached under the obol data dir via HF_HOME). Typed,
deterministically-indexed placeholders keep cross-message references
linkable without revealing the underlying values.

Embedded skill: internal/embed/skills/dataset-anonymize/.
Validated: masks email/IP/key/path/card with no raw leak; deterministic
indexing across records; embed skills test green.
One contract (run(dataset, base_model, hyperparams) -> adapter + eval +
run.manifest) over mock / mlx-lora / unsloth / axolotl / torchtune backends,
selected by --backend. Every backend reads the same sft.jsonl artifact. The
runner binds each result to the dataset's content-address (manifestHash) in
run.manifest — the provenance link from a fine-tuned model back to the data
it trained on. The shared idea across real backends is "regex-extract the
eval metric from stdout"; --dry-run validates the dataset without invoking a
backend; the mock backend validates the contract + provenance anywhere.

Embedded skill: internal/embed/skills/finetune-backend/.
Validated: mock run emits eval.json + run.manifest with dataset_hash ==
manifestHash; dry-run path; embed test green.
End-to-end user guide: anonymize -> record a signed version -> publish
(host + tunnel + membership/version gate) -> federated discovery -> buy
(verifying download) -> fine-tune with provenance. Discovery is the existing
type-agnostic catalog: a priced dataset surfaces in /api/services.json with
its pinned version metadata and the obol-router federates it unchanged (the
catalog-entry shape is schema-validated in the type=dataset catalog tests).
No central hub — each operator owns their dataset.

docs/guides/monetize-dataset.md.
flows/hf-surface-smoke.sh validates the "decentralised HF" surfaces end to
end: (1) dataset hub anonymize -> sign -> publish -> buy (resumable,
hash-verified); (2) inference + dataset offers in a federated catalog;
(3) fine-tune on a real GPU box (spark) with run.manifest bound to the
bought dataset's content-address; (4) discovery via obol-router federation;
(5) cross-check against the obol-exex ERC-8004 indexer. Each surface is
independent (missing prereq -> SKIP).

plans/dataset-subscription-v1.1-pitch.md is the held-for-later, diagram-based
pitch for continuous escrow subscriptions (reserve multi-deadline voucher ->
capture per shipped version -> void the tail), honest about the new per-epoch
capture wiring it needs and its dependency on the payout leg.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant