RFC 0010 (draft): Analyst-skill seams on a pipeline-shaped session substrate#11
RFC 0010 (draft): Analyst-skill seams on a pipeline-shaped session substrate#11david-steeves wants to merge 3 commits into
Conversation
- Lead Summary with structural-integrity framing - Call out local SQLite / embedded log / object store as substrate options - Cite enabling work by PR number (openclaw#5, openclaw#7) rather than contested RFC IDs - Soften PII/PHI claims to "declare redaction intent"; defer forensic attestation to a follow-up RFC - transform verdict retains original payload in write-restricted store - escalate is async with pending-human-review marker + Policy-set timeout - New goal: stage plugins are passive I/O; no payload mutation - Define empty-evidence stamp in Proposal - Drop "in different costumes" metaphor; tighten throat-clearing - Anchor "seam" terminology with a definition - Expand Unresolved Questions: supply chain, ToCToU, multi-analyst conflict, contract-RFC location
|
Codex review: needs real behavior proof before merge. Reviewed June 8, 2026, 6:08 PM ET / 22:08 UTC. Summary Reproducibility: not applicable. this is an RFC proposal, not a bug report. I checked current main for matching analyst-skill pipeline text and found no existing implementation or RFC that would make it obsolete. Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review findings
Review detailsBest possible solution: Keep this as a draft RFC until maintainer discussion settles the architecture, then update metadata/status and link an implementation issue before merge. Do we have a high-confidence way to reproduce the issue? Not applicable; this is an RFC proposal, not a bug report. I checked current main for matching analyst-skill pipeline text and found no existing implementation or RFC that would make it obsolete. Is this the best way to solve the issue? Unclear as a final solution. The draft is a plausible architecture direction, but it deliberately defers the API/config contract and calls out unresolved supply-chain and policy semantics that need maintainer direction. Full review comments:
Overall correctness: patch is correct AGENTS.md: not found in the target repository. Codex review notes: model gpt-5.5, reasoning high; reviewed against e938e93198f4. Label changesLabel changes:
Label justifications:
Evidence reviewedWhat I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
Working on perf numbers to compare impact of just the shape without reviewers doing really analysis. |
👋 First-time contributor opening RFC 0010 (draft) — Analyst-skill seams on a pipeline-shaped session substrate (this PR). TL;DR of the RFC. Shape OpenClaw's session/transcript substrate as a 3-stage data pipeline ( Why I think now is the moment. Three in-flight threads converge on the same gap — no named place where policy can run on flowing session data, and no structural separation between producer and evaluator:
Bench numbersI built a bench lab to put numbers on the cost:
M5 Pro native, 10 concurrent synthetic agents, 3 payload size classes, 6 substrate variants, 3 runs each, median reported. Headline numbers (p50):
* file-share's 100 MB number is fake-cheap — RSS ballooned to 6.6 GB holding the read-back cache. Honest substrate comparison has to declare read-back semantics. What the numbers tell me — and the design implication I want maintainer steer onThe pipeline shape is structurally cheap at chat-message sizes, but the multiplier vs file-share grows with payload size — 35× at 5 MB, ~180× at 100 MB. That turns into a hot complaint thread the day someone runs a long-context coding session emitting multi-MB tool outputs. So my read: the seam-and-evidence pipeline is not a free upgrade you flip on for every OpenClaw instance. It's a real cost worth paying only when something justifies it — regulated deployments, parental-controls, hosted multi-tenant, finance/legal claws with mandatory analyst review. For a hobbyist on a single calculator-claw, paying 5× latency to stamp empty evidence into 5 SQLite databases is the wrong tradeoff. This flips what I'd originally scoped as a post-MVP nice-to-have into something the MVP probably has to anticipate from day one: governance is configurable, not default-on, and the configuration unit is per-claw (or per-instance) — not per-runtime. A Gateway-level "governed mode" setting, configurable per-claw:
Three properties: (a) existing claws need zero code changes; (b) different claws can be governed at different strictness levels without forking the runtime; (c) a compromised claw cannot bypass the pipeline — the only "store" handle it has is one end of the pipe. Analog: service-mesh sidecar (Istio/Linkerd). If MVP ships pipeline as "on for everything," the bench numbers predict an adoption-blocking complaint thread. If MVP ships it as "available, opt-in per claw," the same numbers become a feature — of course it's slower, that's what governance is paying for. Newbie questions I'd love steer on before I push this further
Happy to iterate on shape, scope, or terminology. |
The previous draft was framed as "RFC 0012 — reference review teams + policy composition + clarifying prompts." Wrong shape. Bench numbers + the per-claw "governed mode" reframing already shipped to RFC 0010 PR (openclaw/rfcs#11 comment 4656114434 on 2026-06-09). We are awaiting maintainer reaction; opening follow-up RFCs now is pushing rope. Normal in mature RFC processes (TC39, W3C, Rust, K8s KEPs): Stage 1 — reference impls + harness in a sibling repo, cited from the open PR as evidence (we are here) Stage 2 — stabilize through implementation experience, tagged release Stage 3 — open follow-up RFCs at maintainer-requested granularity Plan changes: - All artifacts (review teams, policies, control-plane config, clarifying prompts, mock data, harness) move into one new public repo `openclaw-test-harnesses`. openclaw-rfcs stays docs-only. - Reference review teams renamed: hr-reviewer -> pii-reviewer, snowflake-reviewer -> phi-reviewer (describes review purpose, not source format; mock-data dirs keep source-shape names). - Public from day 1 — accelerates an answer to Q3 of the 2026-06-09 reply (research/review WG question) by putting concrete evidence in front of maintainers. - Deliverable is one PR comment on openclaw/rfcs#11, not a new RFC PR. Cite a pinned SHA / tagged release (rfc-0010-v1), WPT / test262 precedent. Include a Proof Shape table with deep-links. - Stage 2 / Stage 3 follow-up RFCs (0012/0013/0014/0015) listed as placeholders only — drafted only if maintainers ask. New file: plans/pr-comment-when-impl-ships.md — the draft PR comment itself, ready to paste at step 10 (after rfc-0010-v1 tag + sanity gates green). Numbers placeholders to fill from the perf-test-vs-bench run. Wall-clock budget drops from ~8h to ~6.5h (no RFC prose drafting). Cross-references in implementation.md and HANDOFF.md updated to reflect the actual 2026-06-09 reply state — bench numbers ARE shipped; the M5 re-run produces final citable hardware numbers but isn't gating the evidence track.
Stage-1 evidence track for openclaw/rfcs#11 (RFC 0010 - analyst-skill seams on a pipeline-shaped session substrate). Reference implementation + functional test harness + perf harness that delegates to openclaw-pipeline-bench. Built per plan: openclaw-pipeline-bench/plans/2026-06-19-policy-harness-extension.md What's included: - control-plane/: openclaw.config.yaml JSON Schema + Renovate-style preset composition example (pii-reviewer + phi-reviewer wired up). - clarifying-prompts/: decision-tree YAML schema + 2 upload + 2 analyze prompt sets with skip-logic, conditional follow-ups, and enforcement modes (advisory / suppresses / escalates). Novel surface; closest analog Stripe Radar 4-action. - policies/: layered policy YAML (system + shared-lib + team-scoped). Composition rule: max(verdicts), PASS < WARN < BLOCK, scoped layer cannot downgrade system. Precedent: AWS IAM explicit-deny + MS Purview most- restrictive-wins. - teams/pii-reviewer/ + teams/phi-reviewer/: 6 reference analyst Python modules doing real work (SSN regex, name+DOB cooccurrence, salary detection, PHI marker, patient-id re-id vector, free-text PHI scan). Renamed from hr-reviewer/snowflake-reviewer per plan. - mock-data/: deterministic Faker/Synthea-shaped generators + annotated CSVs with row-range -> expected-verdict mapping. HR-PII 10k rows; PHI-snowflakey 5k rows (override --rows 250000 for canonical artifact). - harness/policy_eval/: composition engine + ingest+assert runner + 14 pytest asserts (all green). Verdict tape emitted per run. - harness/perf/: pipeline-real-claws and pipeline-real-claws-pass-only variants with their own SQLite-3-stage substrate that mirrors bench contract but plugs in real analysts at the seams. Workload streamed from mock CSVs. - scripts/generate_comparison_report.py: side-by-side report vs sibling openclaw-pipeline-bench bench/results/. - Makefile + docker-compose.yml: install / gen-data / policy-test / perf-test / perf-test-vs-bench / all / clean. Verified at commit time: - make install passes - gen-data produces both csv.gz files deterministically - policy-eval runner: 1800/1800 row-level assertions pass - pytest asserts: 14/14 pass - smoke perf run emits RESULT line (p50 ~ 4ms in-memory on MacBook) - docker compose build succeeds for both perf variants Not yet run (M4 Pro session work): - Full perf bench (3 runs x 360s per variant) - perf-test-vs-bench comparison report with real numbers - rfc-0010-v1 tag once sanity gates green Naming note: parent plan uses hyphenated policy-eval; Python module is policy_eval (snake_case). Documented in harness/policy_eval/__init__.py.
Synthesized 3 Amazon Principal SDE workshop perspectives on David's framing problem: > "Data analysis either happens on the substrate/pipeline or inside > openclaw as folks try to put governance in each claw or > processed/filtered/etc inside openclaw proper. This analysis needs > to happen somewhere... I am not saying this is the invention of > doing data governance in openclaw, I am saying this is the correct > architecture and this is what it'll cost. I don't know or claim to > know the cost of the other data analysis and governance designs for > openclaw, but these numbers are pretty sick." Output: a drop-in PR-comment section he can paste verbatim into the openclaw/rfcs#11 follow-up when he green-lights. Five sub-sections: ## Where governance lives, and what we measured Names three honest places (per-claw, runtime, substrate) and their cost-of-change shapes. IAM/VPC/S3-SSE analogy anchors the factoring argument. Explicit: "The contribution of this RFC is not that OpenClaw should have governance — that need is uncontroversial. The contribution is where the seam goes." ### Our cost The bench table verbatim. ### What we did not measure "We have not benchmarked the per-claw or in-runtime factorings on the same workload, and we are not going to estimate their cost. If anyone runs governance-in-the-runtime or governance-per-claw against the same offered load with comparable analyst work, we will publish the three tables next to each other." ### Why this design pays back: evolution "The cost of a better detector is a PR against a skill." ### Ask of maintainers "We are not asking for a ship decision. We are asking for direction on a specific question: do you accept the substrate as the governance seam for OpenClaw?" Voice is senior Amazon principal: calm, declarative, no marketing, no claim that alternatives are slower, no claim of invention. Workflow: 4 agents, ~282k subagent tokens. David's call when to paste. RFC update stays parked (Task #8).
…view) Ready-to-paste markdown comment that: - Leads with the architectural framing (3 places governance can live; IAM/VPC/S3-SSE analogy; the "contribution is where the seam goes" move) - Reproduces the M4 bench numbers from the cost-of-governance section, including the real-claws row (8.90 ms p50, 72 eps, 1800/1800 asserts) - Calls out what was not measured (no per-claw or in-runtime comparison; signed evidence, supply chain, ReDoS, TOCTOU all unmeasured) - Frames evolution as "the cost of a better detector is a PR against a skill" - Asks the narrow question: do maintainers accept the substrate as the governance seam? Either answer is useful - Pins openclaw-test-harnesses@23a42f8 and openclaw-pipeline-bench@0c38e2a For David's review before posting to openclaw/rfcs#11.
Summary
Proposes shaping OpenClaw's session/transcript substrate as a three-stage data pipeline (
raw→processed→curated) so that stage boundaries become explicit seams where designated analyst skills can read, classify, transform, or hold payloads — and reserves the substrate, the analyst registry, and Policy artifacts to the Gateway control plane so agent-plane code cannot rewrite its own supervisor.The pipeline is a contract about shape, not backend: local SQLite, an embedded log, an object store, or a managed remote queue can each back any stage. Analyst skills run in the existing skill runtime and emit verdicts (
pass,transform,block,escalate) shaped as redacted evidence in the form RFC 0003 already defines.Why now
Three in-flight threads converge on the same structural gap — a place where policy can be evaluated against flowing session data, with the supervisor structurally separated from the supervised:
Today's substrate has no named moments where policy can run on flowing data, and no structural separation between the code that produces data and the code that evaluates it. Filing this now, before #7 hardens, lets the pipeline shape inform the SDK rather than retrofit onto it.
Scope
Status
status: draft. Opening as a draft PR to start the maintainer-discussion thread per the recently merged RFC lifecycle docs. I'll start amaintainer-discussionthread on Discord under my identity.Reviewer-relevant deltas already applied
V1 → v2 closed must-fix findings from an internal review panel: (a) softened PII/PHI claims to "declare redaction intent" with forensic attestation explicitly deferred; (b)
transformretains original payload in a write-restricted store for Policy audit; (c)escalateis async withpending-human-reviewmarker and Policy-configured timeout; (d) stage plugins constrained to passive I/O — payload mutation is exclusively the analyst-skill surface; (e) empty-evidence stamp defined inline; (f) citations use PR numbers rather than contested RFC IDs.Unresolved questions
Eight, listed in the RFC. The ones I'd most like maintainer steer on early:
First-time contributor; happy to iterate on shape, scope, or terminology.