feat(core): add decompose-to-ui-kit + boolean parity verifiers (Phase 1 of #225)#241
feat(core): add decompose-to-ui-kit + boolean parity verifiers (Phase 1 of #225)#241HomenShum wants to merge 11 commits intoOpenCoworkAI:mainfrom
Conversation
Adds a new agent tool that decomposes the current artifact into a ui_kits/<slug>/ folder structure (index.html + components/*.tsx + tokens.css + manifest.json + README.md), shaped for handoff to a downstream coding agent (Claude Code, Cursor, etc.). - New tool factory in packages/core/src/tools/decompose-to-ui-kit.ts follows the existing factory + AgentTool + typebox pattern from done.ts and generate-image-asset.ts. - New "Decompose to UI Kit" item in chat AddMenu, gated on having a current design and not currently generating. - New triggerDecompose store action + decomposePrompt.ts hook, mirroring the polishPrompt.ts pattern but user-triggered (no auto-fire). Sends the prompt as a silent follow-up so the chat reads as one continuous run. - Output carries schemaVersion: 1 in manifest.json so downstream consumers can evolve safely. - Decomposition is prompt-driven (model identifies repeated DOM subtrees and emits the structured plan); the tool just persists to the virtual fs in a single atomic call. i18n keys added in en + zh-CN. No new dependencies. Closes the Phase 1 ask in OpenCoworkAI#225. 10 new unit tests cover: typical decomposition, slug sanitization, fallback slug, manifest schemaVersion, token CSS grouping, token name normalization, README rendering, empty inputs, return shape, and undefined-fs handling. Verified: - pnpm lint clean - pnpm typecheck clean (10/10 workspace tasks) - pnpm test green (1026 desktop + 252 core tests pass) Signed-off-by: homen <hshum2018@gmail.com>
Adds a deterministic parity verifier the agent calls AFTER decompose_to_ui_kit
and uses to self-correct before calling done. No LLM judge involved — the
parity report is reproducible from the raw HTML / CSS strings.
Three signals comparing source index.html vs ui_kits/<slug>/index.html and
ui_kits/<slug>/tokens.css:
1. Element count parity — structural tag distribution (div/section/button/
h1-h6/table/etc.), weighted 0.4 in overall score
2. Visible text coverage — % of unique source words present in decomposed,
weighted 0.3
3. Token coverage — % of unique hex / rgb / px / rem values from source
captured in tokens.css (gaps capped at 8 to keep agent context small),
weighted 0.3
Returns a ParityReport with an explicit gaps list. If parityScore < 0.85
the prompt instructs the agent to re-call decompose_to_ui_kit with
adjustments addressing the specific gaps, then re-verify. Iterates at most
twice to avoid loops; final done() summary honestly states the achieved
parityScore + remaining gaps.
Pattern mirrors done.ts: deterministic checker run during the agent's own
turn so it can self-correct before declaring the artifact complete.
7 new unit tests cover: high-parity faithful decomposition, low-parity thin
decomposition, missing artifact handling, hardcoded values absent from
tokens.css, undefined-fs fallback, byte-identical input, and pass/fail
summary text.
decomposePrompt.ts updated for both EN and ZH locales to walk the agent
through the verify-and-iterate loop explicitly.
Verified:
- pnpm lint clean
- pnpm typecheck clean (10/10 workspace tasks)
- pnpm test green (252 core + all other packages, 17 new tests across
decompose-to-ui-kit + verify-ui-kit-parity)
Signed-off-by: homen <hshum2018@gmail.com>
…ension scoring Adds the vision-LLM judge counterpart to the existing deterministic verify_ui_kit_parity. Renders the decomposed ui_kits/<slug>/index.html in a hidden window via the host-injected renderUiKit callback, screenshots it, and asks a multimodal model to compare against the source artifact via the host-injected judgeVisualParity callback. Scoring is BOOLEAN-per-dimension, NOT floating-point — matches NodeBench's established rule patterns (pipeline_operational_standard.md 10-gate boolean catalog, eval_flywheel.md boolean evaluators, agent_run_verdict_workflow.md bounded enum verdicts). The judge answers 12 standard checks on every run (across layout / color / typography / content / components dimensions), each yes/no with an explicit reason string. The aggregate parityScore is DERIVED as passCount/totalChecks (never LLM-arbitrary). Status is bounded enum (verified / needs_review / needs_iteration / failed) thresholded deterministically: - 100% passed -> verified - >=85% passed -> needs_review - >=60% passed -> needs_iteration - <60% passed -> failed Why boolean over floating-point: lower judge variance (yes/no is harder to fudge than a number), every failure has a clear actionable reason, score is derived not LLM-arbitrary, comparable across runs/models/time. Failure-of-judge counts as failure-of-parity (HONEST_SCORES rule from agentic_reliability.md). Pattern mirrors generate-image-asset.ts: host injects two callbacks (renderUiKit, judgeVisualParity). Without them the tool returns status="unavailable" and the agent falls back to the deterministic verifier. decomposePrompt.ts (EN + ZH) updated to call BOTH verifiers and reconcile gaps before deciding to iterate or finish. 17 new unit tests cover: status thresholds across the verified/needs_review/ needs_iteration/failed bands, all-pass/all-fail/partial check sets, missing fs/render/judge callbacks, missing artifacts, missing source image, source image format validation, abort signal threading, and HONEST_SCORES guarantee that every check carries a reason. Verified: - pnpm lint clean - pnpm typecheck clean (10/10 workspace tasks) - pnpm test green (276 core including 17 new + 1026 desktop + others) Signed-off-by: homen <hshum2018@gmail.com>
…ks + toast feedback The verify_ui_kit_visual_parity tool was returning status="unavailable" because the host hadn't injected its two callbacks. This commit completes the wiring so the visual judge runs LIVE during decompose. Three new pieces: 1. apps/desktop/src/main/render-ui-kit.ts (~110 LOC) Hidden BrowserWindow + offscreen render + capturePage. Mirrors done-verify.ts pattern. Loads the decomposed ui_kits/<slug>/index.html in a sandboxed offscreen window, waits for did-finish-load + a 1500ms settle window for fonts/CSS, then captures a PNG and returns it as a base64 data URL. Honors AbortSignal + 12s hard timeout. 2. apps/desktop/src/main/judge-visual-parity.ts (~230 LOC) Vision-LLM judge with the same 12 standard boolean parity checks as the in-core tool. Decoupled from cfg plumbing — takes a runVisionPrompt callback the host wires using its existing generation pipeline. Asks the model to answer each check yes/no with a reason, parses defensively (code-fence strip + balanced-brace extract), returns structured per- check answers that the in-core tool normalizes into a deterministic parityScore + bounded-enum status. 3. apps/desktop/src/main/index.ts wiring Constructs both callbacks at runGenerate time and passes them to generateViaAgent's deps. The judge re-uses the SAME model/apiKey/ baseUrl/wire/capabilities as the active generation request, so we don't need a separate judge config — whatever model the user picked for generation is the model that judges parity. If the model isn't vision-capable the judge throws and the agent falls back to the deterministic verify_ui_kit_parity. Bonus: triggerDecompose store action now surfaces three toasts covering all branches (busy / no-artifact-yet / decomposing-now), with i18n keys in en + zh-CN. Previously the action silently no-op'd when conditions weren't met, which the user caught during dogfood. Verified: - pnpm lint clean (1 noShadowRestrictedNames fix on local `escape` var) - pnpm typecheck clean (10/10 workspace tasks) - pnpm test green (276 core + 1026 desktop + others) - Live-DOM dogfood with Playwright in browser-mode passed all 12 checks including the new menu item rendering and console-error-clean reload Signed-off-by: homen <hshum2018@gmail.com>
…t in done summary
Two pieces, no defer:
1. docs/benchmarks/DECOMPOSE_TO_UI_KIT.md (~280 lines)
- Full methodology: 4-stage pipeline + 12 standard boolean checks +
status thresholds + cost methodology + cache key derivation
- Real numbers from four cross-tier runs on the same NodeBench Reports
source (cached): Opus reference, Pro+Pro with iteration loop
demonstration, mixed Flash-Lite-decompose + Pro-judge, cheapest tier
- Specific gap signal showing the verify-and-iterate loop climbing
parity 0.69 -> 0.78 in one self-correcting round
- Recommendation matrix: production / continuous-eval / CI-smoke
- Reproducibility instructions with exact CLI commands
- Honest non-claims section (no claim of universal parity, no claim
gpt-image-1 mockups are production-quality, no claim cheap tier
hits 0.85 first-pass)
- Documented model failures (Kimi K2.6 truncation via OpenRouter,
GLM 4.6V malformed JSON)
- Citations to 2026 VLM-as-judge research + NodeBench's own internal
boolean-evaluator rule patterns
2. decomposePrompt.ts updated (EN + ZH) — done summary MUST report:
- Deterministic verifier passCount/totalChecks + status
- Visual judge passCount/12 + status
- Visual judge judgeCostUsd (this run's self-verify spend)
- Remaining unfixed gaps with failed-check ids + why
"Do NOT hide cost. Do NOT inflate scores. Failed checks count as failed."
The cost surfacing is prompt-driven (the agent always reports it in
chat) — orthogonal to a future UI cost meter, but ensures honest cost
accounting today without renderer surgery.
Verified:
- pnpm lint clean
- pnpm typecheck clean (10/10 workspace tasks)
Signed-off-by: homen <hshum2018@gmail.com>
Tracks the boolean-rubric methodology + reproducible cross-tier results for verify_ui_kit_visual_parity. docs/ is gitignored per CLAUDE.md so this lives at repo root alongside README.md / CONTRIBUTING.md. Re-publishes the decompose-to-ui-kit benchmark previously committed to docs/benchmarks/ (which was silently dropped by .gitignore). Signed-off-by: homen <hshum2018@gmail.com>
When `verify_ui_kit_visual_parity` resolves, the renderer now reads `judgeCostUsd`, `passCount`, `totalChecks`, and `status` defensively from the structured ParityReport and pushes a toast — operator sees a per- decompose cost row without needing a new dashboard. Variant flips to `success` for verified/needs_review, `info` otherwise. Reads the result shape with bracket access so a future contract drift degrades to silent rather than crashing the renderer. New i18n keys `sidebar.decomposeJudgeResultTitle/Description` in en + zh-CN. README + README.zh-CN now mention the Decompose to UI Kit feature under "What's new" + Generation features so the entry point is discoverable from the repo landing. Signed-off-by: homen <hshum2018@gmail.com>
Side-by-side hero image: source.png (gpt-image input) vs rendered.png (agent-emitted ui_kit, headless-rendered) from the e2e-opus-final PoC run. parityScore badge (0.90) and status are derived deterministically from the 12-check boolean rubric — passCount / totalChecks — not an LLM-fabricated float. Hosted in this branch under website/public/screenshots/decompose-to-ui-kit.png so the github raw URL renders inline on github.com regardless of upstream merge state. Both README.md and README.zh-CN.md now embed the same image with a matching subcaption that calls out: real run (not mock), source -> rendered direction, and the deterministic-derivation invariant. Signed-off-by: homen <hshum2018@gmail.com>
4-frame reel from the e2e-nodebench-iter PoC run:
1. SOURCE gpt-image input
2. ITER-0 parityScore 0.82, status needs_iteration, 6 gaps surfaced
3. ITER-1 parityScore 0.78, status needs_iteration, 5 gaps surfaced
4. HONEST Δ score -0.04, Δ gaps -1 -- agent fixed some gaps but
regressed on layout, boolean rubric exposes the drift
instead of hiding it (HONEST_SCORES rule)
Both gif (393.9 KB, 1080px wide, 10fps) and mp4 (224.2 KB, h.264 yuv420p)
shipped under website/public/demos/ to match the existing demo asset
convention. README.md + README.zh-CN.md embed the gif inline directly
under the side-by-side hero with subcaption explaining the deliberate
choice to show drift, plus a link to the mp4 for quality-sensitive
viewers.
Hosted in this branch so the github raw URL renders inline on github.com
regardless of upstream merge state.
Signed-off-by: homen <hshum2018@gmail.com>
Closes Phase 1 of OpenCoworkAI#225. Signed-off-by: homen <hshum2018@gmail.com>
There was a problem hiding this comment.
Findings
-
[Major] Decompose loop success can never trigger on the first clean pass —
decomposePrompt.tsrequires both verifiers to returnverifiedorneeds_review, but the deterministic verifier only returnsokorneeds_iteration. That forces an unnecessary extra iteration even when deterministic parity already passed, which adds avoidable cost and can regress a good bundle, evidenceapps/desktop/src/renderer/src/hooks/decomposePrompt.ts:67,packages/core/src/tools/verify-ui-kit-parity.ts:35,packages/core/src/tools/verify-ui-kit-parity.ts:294.
Suggested fix:const deterministicPass = deterministic.status === 'ok'; const visualPass = visual.status === 'unavailable' || visual.status === 'verified' || visual.status === 'needs_review'; if (deterministicPass && visualPass) { // call done }
-
[Major]
verify_ui_kit_visual_parity({slug})has no source image on the default runtime path — the tool defaults tosource.png, but the agent FS is initialized withindex.html, frames, and skills only, whilepreparePromptContext()keeps attachments in prompt context instead of persisting them into the virtual FS. In normal runs the visual verifier therefore degrades tounavailableinstead of actually judging parity, evidencepackages/core/src/tools/verify-ui-kit-visual-parity.ts:291,apps/desktop/src/main/index.ts:294,apps/desktop/src/main/index.ts:879,apps/desktop/src/main/prompt-context.ts:287.
Suggested fix:const firstImage = promptContext.attachments.find((a) => a.imageDataUrl); if (firstImage?.imageDataUrl) { await fs.create('source.png', firstImage.imageDataUrl); }
-
[Major] Judge/render failures do not fall back to a structured result —
makeJudgeVisualParity()throws on empty or non-JSON model replies, andverify_ui_kit_visual_parityawaits bothrenderUiKit()andjudgeVisualParity()without catching those failures. On a text-only model or malformed judge response, the tool errors instead of returning the advertisedstatus: "unavailable"path, evidenceapps/desktop/src/main/judge-visual-parity.ts:153,apps/desktop/src/main/judge-visual-parity.ts:197,packages/core/src/tools/verify-ui-kit-visual-parity.ts:311.
Suggested fix:try { const candidateImg = await renderUiKit(decomposed.content, signal); const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal); // existing normalization... } catch (error) { const report = unavailableReport( error instanceof Error ? error.message : String(error), ); return { content: [{ type: 'text', text: report.summary }], details: report }; }
-
[Major] This still looks like partial work for #225, so
Closes #225is misleading — the public PR template says to useClosesonly when the issue is fully resolved, but this diff stops at emitting aui_kits/<slug>/handoff bundle and explicitly tells the agent not to continue into the downstream prototype flow, evidence.github/PULL_REQUEST_TEMPLATE.md:11,.changeset/decompose-to-ui-kit.md:9,apps/desktop/src/renderer/src/hooks/decomposePrompt.ts:58,packages/core/src/tools/decompose-to-ui-kit.ts:155.
Suggested fix:Refs #225
Summary
- Review mode: initial
- Found 4 issues: one decompose-loop contract bug, two visual-verifier runtime gaps, and one incomplete issue-closure claim.
Testing
- Not run (automation). Suggested: add one agent-runtime integration test that seeds an image attachment into the virtual FS and asserts
verify_ui_kit_visual_paritycan read it, plus one unit test that judge/render failures return a structured fallback instead of throwing.
open-codesign Bot
| - If it returns status="unavailable", the host hasn't injected the judge callback. Proceed with step 4's deterministic report alone. | ||
| - If it returns successfully, read each checks[].passed + reason. Failed checks are the things to fix. | ||
| 6. Reconcile both reports: | ||
| - Both status ∈ {verified, needs_review} (12/12 or 11/12 checks passed): call done |
There was a problem hiding this comment.
verify_ui_kit_parity never returns verified or needs_review, so the success branch described here can never fire. Even a clean deterministic pass gets forced into the iteration path and burns another full decompose cycle.
Suggested fix:
const deterministicPass = deterministic.status === 'ok';
const visualPass =
visual.status === 'unavailable' ||
visual.status === 'verified' ||
visual.status === 'needs_review';| return { content: [{ type: 'text', text: report.summary }], details: report }; | ||
| } | ||
|
|
||
| const sourcePath = params.sourceImagePath ?? 'source.png'; |
There was a problem hiding this comment.
This tool defaults to source.png, but nothing in the runtime seeds an attached image into the agent FS before the tool runs. On the default path that means the visual verifier returns unavailable instead of ever judging parity.
Suggested fix:
const firstImage = promptContext.attachments.find((a) => a.imageDataUrl);
if (firstImage?.imageDataUrl) {
await fs.create('source.png', firstImage.imageDataUrl);
}| }; | ||
|
|
||
| logger.info('[verify_ui_kit_visual_parity] step=render', { slug: params.slug }); | ||
| const candidateImg = await renderUiKit(decomposed.content, signal); |
There was a problem hiding this comment.
The fallback described in the prompt only works for missing dependencies/files. If renderUiKit() or judgeVisualParity() throws (for example on a text-only model or malformed JSON), this tool currently errors instead of returning a structured unavailable/failed report.
Suggested fix:
try {
const candidateImg = await renderUiKit(decomposed.content, signal);
const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal);
} catch (error) {
const report = unavailableReport(error instanceof Error ? error.message : String(error));
return { content: [{ type: 'text', text: report.summary }], details: report };
}|
|
||
| Add **Decompose to UI Kit** — one-click in the chat sidebar emits a `ui_kits/<slug>/` folder shaped for coding-agent handoff (`index.html` + `components/*.tsx` + `tokens.css` + `manifest.json` + `README.md`). Built-in deterministic + vision verifiers self-check parity using a 12-question boolean rubric (`parityScore = passCount / totalChecks`, no LLM-fabricated floats) and re-iterate on gaps. Per-decompose cost surfaces inline as a toast. | ||
|
|
||
| Closes Phase 1 of #225. |
There was a problem hiding this comment.
This release note says the phase is closed, but the shipped flow still stops at producing a handoff bundle and explicitly does not continue into the downstream prototype step. Per the public PR template, this should be Refs #225 unless the full issue scope is implemented here.
Suggested fix:
Refs #225…itespace CodeQL js/bad-tag-filter (HIGH) on PR OpenCoworkAI#241: the literal `<\/script>` / `<\/style>` patterns in stripTags() left bodies behind for HTML5-tolerated end-tag forms like `</script >` (trailing space) and `</script foo="bar">` (end-tag attributes, silently ignored by browsers per spec). A crafted source HTML could leak script/style body text into the visible-word vocabulary used for parity coverage scoring. Fix: mirror the opening-tag pattern's `\b[^>]*` on the close tag too. The `\b` after the tag name prevents over-matching `</scripts>` while the `[^>]*` consumes any tolerated end-tag content up to the closing `>`. Regression test covers all 4 previously-vulnerable forms: - `</script >` (trailing whitespace) - `</script foo="bar">` (end-tag attrs) - `</style >` (style branch) - `</SCRIPT>` (case) Asserts none of 4 leaked tokens appear in the parity report when the decomposition correctly omits the script/style content. Signed-off-by: homen <hshum2018@gmail.com>
There was a problem hiding this comment.
Findings
-
[Major] Decompose loop success can never trigger on the first clean pass —
decomposePrompt.tsstill requires both verifiers to returnverifiedorneeds_review, butverify_ui_kit_parityonly returnsokorneeds_iteration. A clean deterministic pass therefore gets forced into another full decompose cycle, which adds avoidable cost and can regress a good bundle, evidenceapps/desktop/src/renderer/src/hooks/decomposePrompt.ts:72,packages/core/src/tools/verify-ui-kit-parity.ts:35,packages/core/src/tools/verify-ui-kit-parity.ts:300.
Suggested fix:const deterministicPass = deterministic.status === 'ok'; const visualPass = visual.status === 'unavailable' || visual.status === 'verified' || visual.status === 'needs_review'; if (deterministicPass && visualPass) { // call done }
-
[Major]
verify_ui_kit_visual_parity({slug})still has no source image on the default runtime path — the tool defaults tosource.png, but the runtime only prepares attachment context and passes images to the model wire; it never persists an attached image into the agent FS. On the normal path the visual verifier therefore returnsunavailableinstead of actually judging parity, evidencepackages/core/src/tools/verify-ui-kit-visual-parity.ts:291,apps/desktop/src/main/prompt-context.ts:287,apps/desktop/src/main/index.ts:914,packages/core/src/index.ts:203.
Suggested fix:const firstImage = promptContext.attachments.find((a) => a.imageDataUrl); if (firstImage?.imageDataUrl) { await fs.create('source.png', firstImage.imageDataUrl); }
-
[Major] Judge/render failures still bypass the advertised structured fallback —
makeJudgeVisualParity()throws on empty or non-JSON model replies, andverify_ui_kit_visual_paritystill awaits bothrenderUiKit()andjudgeVisualParity()without catching those failures. On a text-only model or malformed judge response, the tool errors instead of returning the promisedstatus: "unavailable"path, evidenceapps/desktop/src/main/judge-visual-parity.ts:198,apps/desktop/src/main/judge-visual-parity.ts:210,packages/core/src/tools/verify-ui-kit-visual-parity.ts:312.
Suggested fix:try { const candidateImg = await renderUiKit(decomposed.content, signal); const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal); // existing normalization... } catch (error) { const report = unavailableReport( error instanceof Error ? error.message : String(error), ); return { content: [{ type: 'text', text: report.summary }], details: report }; }
-
[Major] The issue-closing claim is still overstated for this slice — the public PR template says to use
Closesonly when the issue is fully resolved, but this diff still stops at emitting aui_kits/<slug>/handoff bundle and explicitly tells the agent not to continue into the downstream prototype step, evidence.github/PULL_REQUEST_TEMPLATE.md:11,.changeset/decompose-to-ui-kit.md:9,apps/desktop/src/renderer/src/hooks/decomposePrompt.ts:58,packages/core/src/tools/decompose-to-ui-kit.ts:155.
Suggested fix:Refs #225
Summary
- Review mode: follow-up after new commits. The new commit fixes the script/style stripping issue, but these 4 major issues are still present on the latest head.
Testing
- Not run (automation). Suggested: add one agent-runtime integration test that seeds an image attachment into the virtual FS and asserts
verify_ui_kit_visual_paritycan read it, plus one unit test that render/judge failures return a structured fallback instead of throwing.
open-codesign Bot
| - If it returns status="unavailable", the host hasn't injected the judge callback. Proceed with step 4's deterministic report alone. | ||
| - If it returns successfully, read each checks[].passed + reason. Failed checks are the things to fix. | ||
| 6. Reconcile both reports: | ||
| - Both status ∈ {verified, needs_review} (12/12 or 11/12 checks passed): call done |
There was a problem hiding this comment.
verify_ui_kit_parity still never returns verified or needs_review, so this success branch can never fire. Even a clean deterministic pass gets forced into another full decompose cycle.
Suggested fix:
const deterministicPass = deterministic.status === 'ok';
const visualPass =
visual.status === 'unavailable' ||
visual.status === 'verified' ||
visual.status === 'needs_review';
if (deterministicPass && visualPass) {
// call done
}| return { content: [{ type: 'text', text: report.summary }], details: report }; | ||
| } | ||
|
|
||
| const sourcePath = params.sourceImagePath ?? 'source.png'; |
There was a problem hiding this comment.
This tool still defaults to source.png, but nothing in the runtime persists an attached image into the agent FS before the tool runs. On the default path that means the visual verifier returns unavailable instead of ever judging parity.
Suggested fix:
const firstImage = promptContext.attachments.find((a) => a.imageDataUrl);
if (firstImage?.imageDataUrl) {
await fs.create('source.png', firstImage.imageDataUrl);
}| }; | ||
|
|
||
| logger.info('[verify_ui_kit_visual_parity] step=render', { slug: params.slug }); | ||
| const candidateImg = await renderUiKit(decomposed.content, signal); |
There was a problem hiding this comment.
The fallback described in the prompt still only covers missing dependencies/files. If renderUiKit() or judgeVisualParity() throws (for example on a text-only model or malformed JSON), this tool errors instead of returning a structured unavailable report.
Suggested fix:
try {
const candidateImg = await renderUiKit(decomposed.content, signal);
const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal);
} catch (error) {
const report = unavailableReport(error instanceof Error ? error.message : String(error));
return { content: [{ type: 'text', text: report.summary }], details: report };
}|
|
||
| Add **Decompose to UI Kit** — one-click in the chat sidebar emits a `ui_kits/<slug>/` folder shaped for coding-agent handoff (`index.html` + `components/*.tsx` + `tokens.css` + `manifest.json` + `README.md`). Built-in deterministic + vision verifiers self-check parity using a 12-question boolean rubric (`parityScore = passCount / totalChecks`, no LLM-fabricated floats) and re-iterate on gaps. Per-decompose cost surfaces inline as a toast. | ||
|
|
||
| Closes Phase 1 of #225. |
There was a problem hiding this comment.
This release note still says the phase is closed, but the shipped flow stops at producing a handoff bundle and explicitly does not continue into the downstream prototype step. Per the public PR template, this should be Refs #225 unless the full issue scope is implemented here.
Suggested fix:
Refs #225
Summary
Phase 1 of #225: a single-image → componentized
ui_kit/decomposition pipeline that emits a coding-agent-ready bundle, plus deterministic + vision verifiers that self-check parity using a 12-question boolean rubric and re-iterate on gaps. Uses existinguserImagesplumbing (PR #193) and adds three new agent tools that mirror existing patterns (done.ts/generate-image-asset.ts). Ends in the chat sidebar with a one-click trigger that fires a structured prompt, walks the agent through decompose → verify → reconcile → done, and surfaces per-decompose cost as a toast. No new prod deps, no SQLite schema change, in-memory output via the Files panel.This PR closes the Phase 1 part of #225 only. The Phase 2 (gpt-image-2 generation in the loop) and Phase 3 (multi-page flow) cuts I committed to in the issue thread are intentionally not included.
Type of change
Linked issue
Closes #225 (Phase 1 only — Phase 2/3 deferred per my comment)
What's in here
3 new agent tools in
packages/core/src/tools/:decompose-to-ui-kit.ts— orchestrator. Takes a source image (from chat context) + design brief, emitsui_kits/<slug>/{index.html, components/*.tsx, tokens.css, manifest.json, README.md}to the virtual FS. Output carriesschemaVersion: 1so downstream coding agents (Claude Code, Cursor) can evolve safely.verify-ui-kit-parity.ts— deterministic verifier. 3 signals: element-count parity, visible-text coverage, token coverage. Returns aParityReportwithpassCount/totalChecksderived score (no LLM in the loop, no floats).verify-ui-kit-visual-parity.ts— vision-LLM judge wrapper. Takes a host-injectedjudgeVisualParitycallback, runs a 12-check boolean rubric across 5 dimensions (layout / color / typography / content / components), returnsparityScore = passCount / totalChecksand a bounded-enumstatus(verified | needs_review | needs_iteration | failed | unavailable).Host wiring in
apps/desktop/src/main/:render-ui-kit.ts— offscreenBrowserWindow.capturePage()for the rendered ui_kitjudge-visual-parity.ts— vision-judge prompt builder + LLM dispatcher using the existingcomplete()provider abstractionagent.tsdeps interface, mirroring howgenerate_image_assetwas wiredRenderer:
AddMenu.tsx— new "Decompose to UI Kit" entry, disabled when no artifact / generation in flightSidebar.tsx—triggerDecompose(designId, locale)action wired to the menu itemstore.ts— 3-branch toast feedback (busy / unavailable / started) + per-tool-call cost row when the visual judge resolveshooks/decomposePrompt.ts— locale-aware (EN/ZH) structured prompt that walks the agent through decompose → verify → reconcile → iterate (max 2) → done with HONEST cost summaryTests — full vitest coverage in
*.test.tsnext to each tool:decompose-to-ui-kit.test.ts(263 LOC)verify-ui-kit-parity.test.ts(180 LOC)verify-ui-kit-visual-parity.test.ts(295 LOC)i18n — 9 new keys × EN + ZH for the menu entry, toast titles/descriptions, and cost row.
Design decisions
Boolean rubric, not floats. Every visual parity check is
{passed: boolean}, derivedparityScore = passCount / totalChecks. Thestatusfield is a bounded enum derived from thresholds (100% →verified, ≥85% →needs_review, ≥60% →needs_iteration, <60% →failed). No LLM-fabricated confidence floats, no scoring inflation. Aligns with the project'sHONEST_SCORESprecedent (done.ts'sverified: booleanfield).Host-injected callbacks, not framework lock-in.
verify-ui-kit-visual-parity.tsdoesn't import any LLM SDK or any Electron API. It takesRenderUiKitFnandJudgeVisualParityFnas deps. If the host doesn't inject them (e.g. a future headless CLI), the tool returnsstatus: 'unavailable'honestly instead of crashing. Mirrors howgenerate_image_assetis keyed ondeps.generateImageAsset.In-memory output via Files panel, no schema bump. Per my open binary in the issue thread, this PR ships option (a): the
ui_kits/<slug>/lands in the design's virtual FS, surfaces in the existing Files panel, and uses the existing ZIP export for handoff to a coding agent. No SQLite migration, smallest blast radius, consistent with howpolishPrompt.ts's second-pass mutates only in-memory state.schemaVersion: 1on the manifest. Downstream consumers (Claude Code, Cursor) need a stable contract. Adding fields requires no version bump; renaming or removing fields requiresschemaVersion: 2and a parallel-emit window.Anti-hallucination guardrails
The deterministic verifier (
verify-ui-kit-parity.ts) checks visible-text coverage on the emitted ui_kit vs the source brief — if the agent dropped any text content, it fails BEFORE the LLM judge runs. This catches data hallucination cheap. The LLM judge then handles only semantic-quality dimensions (visual hierarchy, color harmony, typography pairing, etc.).Cost surfacing
Every
verify_ui_kit_visual_parityresolution pushes a toast withpassCount/totalChecks · status · $cost.NNNN. Reads defensively fromresult.detailsso future contract drift degrades silently rather than crashing the renderer. Thedonetool's prompt-driven summary additionally requires the agent to report total run cost, per theHONEST_STATUSprecedent.Checklist
docs/VISION.md,docs/PRINCIPLES.md, andCLAUDE.mdbefore startinggit commit -s)pnpm lint && pnpm typecheck && pnpm testpasses locally (1026 tests pass on this branch as ofd6f3a00)pnpm changeset) — see.changeset/decompose-to-ui-kit.mdBENCHMARKS.md(new),README.md+README.zh-CN.md(Decompose to UI Kit feature card + hero PNG + iter-reel GIF)Dependency additions (if any)
None. All three new tools use only
@mariozechner/pi-agent-core'sAgentToolfactory pattern that's already a prod dep.Screenshots / recordings (UI changes)
Side-by-side hero — source vs agent-emitted ui_kit (
e2e-opus-finalrun, parityScore 0.90):4-frame reconcile reel from the
e2e-nodebench-iterrun (iter-0 → iter-1 with honest score drift 0.82 → 0.78 — boolean rubric exposes the regression instead of hiding it):MP4 version for higher fidelity.
Live-recorded session demo (real Electron app, no stitching) — recording in progress, will edit this PR description when the GIF is ready. ETA same day.
Cross-tier benchmarks
BENCHMARKS.mdat repo root has the full methodology + run-by-run real-data results across model tiers (Opus, Pro+Pro+iterate, Kimi+Gemini3, NodeBench iter), reproducibility instructions, honest non-claims, and research citations (WebDevJudge, Prometheus-Vision, Trust-but-Verify ICCV 2025).Note the iter-0 → iter-1 regression on the same source: agent fixed some gaps but introduced new layout drift. The boolean rubric exposes this honestly rather than fudging the score upward. This is the intended behavior, not a bug.
Scope discipline notes
pnpm-lock.yaml) is mechanical regen. This is over the soft 400-LOC bar in CONTRIBUTING.md, but it's been pre-discussed in [Feature]: image 2 已经够厉害了,最需要的是如何把生成好的UI 变成组件化,再到原型的过程! #225 and the change is a single concern (one new feature path, no refactor mixed in). Happy to split into 3 PRs (per-tool) if maintainer prefers — say the word.Ideascategory, not bundling here. Each is a meaningful subsystem that deserves alignment before code.Branch state at PR open
upstream/mainchore(deps)bumps including pi-agent-core 0.67.68 → 0.70.2; my branch is on 0.67.68)AgentToolshape; I'll handle that in the rebase pass.Why this is ready to review now
BENCHMARKS.md, not syntheticLooking forward to feedback. Happy to address structural concerns first before iterating on smaller polish.