feat(core): add decompose-to-ui-kit + boolean parity verifiers (Phase 1 of #225) by HomenShum · Pull Request #241 · OpenCoworkAI/open-codesign

HomenShum · 2026-04-26T22:35:45Z

Summary

Phase 1 of #225: a single-image → componentized ui_kit/ decomposition pipeline that emits a coding-agent-ready bundle, plus deterministic + vision verifiers that self-check parity using a 12-question boolean rubric and re-iterate on gaps. Uses existing userImages plumbing (PR #193) and adds three new agent tools that mirror existing patterns (done.ts / generate-image-asset.ts). Ends in the chat sidebar with a one-click trigger that fires a structured prompt, walks the agent through decompose → verify → reconcile → done, and surfaces per-decompose cost as a toast. No new prod deps, no SQLite schema change, in-memory output via the Files panel.

This PR closes the Phase 1 part of #225 only. The Phase 2 (gpt-image-2 generation in the loop) and Phase 3 (multi-page flow) cuts I committed to in the issue thread are intentionally not included.

Type of change

New feature

Linked issue

Closes #225 (Phase 1 only — Phase 2/3 deferred per my comment)

What's in here

3 new agent tools in packages/core/src/tools/:

decompose-to-ui-kit.ts — orchestrator. Takes a source image (from chat context) + design brief, emits ui_kits/<slug>/{index.html, components/*.tsx, tokens.css, manifest.json, README.md} to the virtual FS. Output carries schemaVersion: 1 so downstream coding agents (Claude Code, Cursor) can evolve safely.
verify-ui-kit-parity.ts — deterministic verifier. 3 signals: element-count parity, visible-text coverage, token coverage. Returns a ParityReport with passCount/totalChecks derived score (no LLM in the loop, no floats).
verify-ui-kit-visual-parity.ts — vision-LLM judge wrapper. Takes a host-injected judgeVisualParity callback, runs a 12-check boolean rubric across 5 dimensions (layout / color / typography / content / components), returns parityScore = passCount / totalChecks and a bounded-enum status (verified | needs_review | needs_iteration | failed | unavailable).

Host wiring in apps/desktop/src/main/:

render-ui-kit.ts — offscreen BrowserWindow.capturePage() for the rendered ui_kit
judge-visual-parity.ts — vision-judge prompt builder + LLM dispatcher using the existing complete() provider abstraction
Both injected into the agent via agent.ts deps interface, mirroring how generate_image_asset was wired

Renderer:

AddMenu.tsx — new "Decompose to UI Kit" entry, disabled when no artifact / generation in flight
Sidebar.tsx — triggerDecompose(designId, locale) action wired to the menu item
store.ts — 3-branch toast feedback (busy / unavailable / started) + per-tool-call cost row when the visual judge resolves
hooks/decomposePrompt.ts — locale-aware (EN/ZH) structured prompt that walks the agent through decompose → verify → reconcile → iterate (max 2) → done with HONEST cost summary

Tests — full vitest coverage in *.test.ts next to each tool:

decompose-to-ui-kit.test.ts (263 LOC)
verify-ui-kit-parity.test.ts (180 LOC)
verify-ui-kit-visual-parity.test.ts (295 LOC)

i18n — 9 new keys × EN + ZH for the menu entry, toast titles/descriptions, and cost row.

Design decisions

Boolean rubric, not floats. Every visual parity check is {passed: boolean}, derived parityScore = passCount / totalChecks. The status field is a bounded enum derived from thresholds (100% → verified, ≥85% → needs_review, ≥60% → needs_iteration, <60% → failed). No LLM-fabricated confidence floats, no scoring inflation. Aligns with the project's HONEST_SCORES precedent (done.ts's verified: boolean field).

Host-injected callbacks, not framework lock-in. verify-ui-kit-visual-parity.ts doesn't import any LLM SDK or any Electron API. It takes RenderUiKitFn and JudgeVisualParityFn as deps. If the host doesn't inject them (e.g. a future headless CLI), the tool returns status: 'unavailable' honestly instead of crashing. Mirrors how generate_image_asset is keyed on deps.generateImageAsset.

In-memory output via Files panel, no schema bump. Per my open binary in the issue thread, this PR ships option (a): the ui_kits/<slug>/ lands in the design's virtual FS, surfaces in the existing Files panel, and uses the existing ZIP export for handoff to a coding agent. No SQLite migration, smallest blast radius, consistent with how polishPrompt.ts's second-pass mutates only in-memory state.

schemaVersion: 1 on the manifest. Downstream consumers (Claude Code, Cursor) need a stable contract. Adding fields requires no version bump; renaming or removing fields requires schemaVersion: 2 and a parallel-emit window.

Anti-hallucination guardrails

The deterministic verifier (verify-ui-kit-parity.ts) checks visible-text coverage on the emitted ui_kit vs the source brief — if the agent dropped any text content, it fails BEFORE the LLM judge runs. This catches data hallucination cheap. The LLM judge then handles only semantic-quality dimensions (visual hierarchy, color harmony, typography pairing, etc.).

Cost surfacing

Every verify_ui_kit_visual_parity resolution pushes a toast with passCount/totalChecks · status · $cost.NNNN. Reads defensively from result.details so future contract drift degrades silently rather than crashing the renderer. The done tool's prompt-driven summary additionally requires the agent to report total run cost, per the HONEST_STATUS precedent.

Checklist

I read docs/VISION.md, docs/PRINCIPLES.md, and CLAUDE.md before starting
Commits are signed with DCO (git commit -s)
pnpm lint && pnpm typecheck && pnpm test passes locally (1026 tests pass on this branch as of d6f3a00)
Added/updated tests for the change (738 LOC across 3 new test files)
Added a changeset (pnpm changeset) — see .changeset/decompose-to-ui-kit.md
Updated docs if behavior changed — BENCHMARKS.md (new), README.md + README.zh-CN.md (Decompose to UI Kit feature card + hero PNG + iter-reel GIF)

Dependency additions (if any)

None. All three new tools use only @mariozechner/pi-agent-core's AgentTool factory pattern that's already a prod dep.

Screenshots / recordings (UI changes)

Side-by-side hero — source vs agent-emitted ui_kit (e2e-opus-final run, parityScore 0.90):

4-frame reconcile reel from the e2e-nodebench-iter run (iter-0 → iter-1 with honest score drift 0.82 → 0.78 — boolean rubric exposes the regression instead of hiding it):

MP4 version for higher fidelity.

Live-recorded session demo (real Electron app, no stitching) — recording in progress, will edit this PR description when the GIF is ready. ETA same day.

Cross-tier benchmarks

BENCHMARKS.md at repo root has the full methodology + run-by-run real-data results across model tiers (Opus, Pro+Pro+iterate, Kimi+Gemini3, NodeBench iter), reproducibility instructions, honest non-claims, and research citations (WebDevJudge, Prometheus-Vision, Trust-but-Verify ICCV 2025).

Run	Decompose	Judge	parityScore	Gaps surfaced
e2e-opus-final	claude-opus-4-1	claude-opus-4-1	0.90	4
e2e-nodebench-iter (iter-0)	gemini-3-pro-preview	gemini-3-pro-preview	0.82	6
e2e-nodebench-iter (iter-1)	gemini-3-pro-preview	gemini-3-pro-preview	0.78	5
e2e-bank-kimi-gemini3	kimi-k2.6	gemini-3-pro-preview	0.78	8
e2e-nodebench-B	kimi-k2.6	gemini-3-pro-preview	0.60	7

Note the iter-0 → iter-1 regression on the same source: agent fixed some gaps but introduced new layout drift. The boolean rubric exposes this honestly rather than fudging the score upward. This is the intended behavior, not a bug.

Scope discipline notes

PR size: ~1500 LOC of substantive change (3 tools + 3 test files + agent wiring + i18n + 1 hook). Most of the diff stat (pnpm-lock.yaml) is mechanical regen. This is over the soft 400-LOC bar in CONTRIBUTING.md, but it's been pre-discussed in [Feature]: image 2 已经够厉害了，最需要的是如何把生成好的UI 变成组件化，再到原型的过程！ #225 and the change is a single concern (one new feature path, no refactor mixed in). Happy to split into 3 PRs (per-tool) if maintainer prefers — say the word.
What's NOT in scope (from [Feature]: image 2 已经够厉害了，最需要的是如何把生成好的UI 变成组件化，再到原型的过程！ #225 thread): multi-page flow (Phase 3, separate issue), gpt-image-2 generation step (Phase 2, separate Discussion), persistence-to-disk (option (b) from the binary I posed — staying with option (a) for blast radius)
Three systemic dependencies surfaced during dogfood (rollback / capability-aware failover / spiral-detector): filing as separate Discussions in Ideas category, not bundling here. Each is a meaningful subsystem that deserves alignment before code.

Branch state at PR open

9 commits ahead of upstream/main
11 commits behind (mostly chore(deps) bumps including pi-agent-core 0.67.68 → 0.70.2; my branch is on 0.67.68)
Will rebase against latest main on request — wanted to open the PR with the as-built state for clarity first. The pi-agent-core 0.70.2 bump may require small adjustments to the new tools' AgentTool shape; I'll handle that in the rebase pass.

Why this is ready to review now

Real cross-tier benchmarks in BENCHMARKS.md, not synthetic
Visual proof embedded above (hero + reel)
Test coverage matches existing tools
Pattern conformance: every new file mirrors an existing precedent
Deliberate scope: closes Phase 1 of the issue cleanly, defers the rest visibly

Looking forward to feedback. Happy to address structural concerns first before iterating on smaller polish.

Adds a new agent tool that decomposes the current artifact into a ui_kits/<slug>/ folder structure (index.html + components/*.tsx + tokens.css + manifest.json + README.md), shaped for handoff to a downstream coding agent (Claude Code, Cursor, etc.). - New tool factory in packages/core/src/tools/decompose-to-ui-kit.ts follows the existing factory + AgentTool + typebox pattern from done.ts and generate-image-asset.ts. - New "Decompose to UI Kit" item in chat AddMenu, gated on having a current design and not currently generating. - New triggerDecompose store action + decomposePrompt.ts hook, mirroring the polishPrompt.ts pattern but user-triggered (no auto-fire). Sends the prompt as a silent follow-up so the chat reads as one continuous run. - Output carries schemaVersion: 1 in manifest.json so downstream consumers can evolve safely. - Decomposition is prompt-driven (model identifies repeated DOM subtrees and emits the structured plan); the tool just persists to the virtual fs in a single atomic call. i18n keys added in en + zh-CN. No new dependencies. Closes the Phase 1 ask in OpenCoworkAI#225. 10 new unit tests cover: typical decomposition, slug sanitization, fallback slug, manifest schemaVersion, token CSS grouping, token name normalization, README rendering, empty inputs, return shape, and undefined-fs handling. Verified: - pnpm lint clean - pnpm typecheck clean (10/10 workspace tasks) - pnpm test green (1026 desktop + 252 core tests pass) Signed-off-by: homen <hshum2018@gmail.com>

Adds a deterministic parity verifier the agent calls AFTER decompose_to_ui_kit and uses to self-correct before calling done. No LLM judge involved — the parity report is reproducible from the raw HTML / CSS strings. Three signals comparing source index.html vs ui_kits/<slug>/index.html and ui_kits/<slug>/tokens.css: 1. Element count parity — structural tag distribution (div/section/button/ h1-h6/table/etc.), weighted 0.4 in overall score 2. Visible text coverage — % of unique source words present in decomposed, weighted 0.3 3. Token coverage — % of unique hex / rgb / px / rem values from source captured in tokens.css (gaps capped at 8 to keep agent context small), weighted 0.3 Returns a ParityReport with an explicit gaps list. If parityScore < 0.85 the prompt instructs the agent to re-call decompose_to_ui_kit with adjustments addressing the specific gaps, then re-verify. Iterates at most twice to avoid loops; final done() summary honestly states the achieved parityScore + remaining gaps. Pattern mirrors done.ts: deterministic checker run during the agent's own turn so it can self-correct before declaring the artifact complete. 7 new unit tests cover: high-parity faithful decomposition, low-parity thin decomposition, missing artifact handling, hardcoded values absent from tokens.css, undefined-fs fallback, byte-identical input, and pass/fail summary text. decomposePrompt.ts updated for both EN and ZH locales to walk the agent through the verify-and-iterate loop explicitly. Verified: - pnpm lint clean - pnpm typecheck clean (10/10 workspace tasks) - pnpm test green (252 core + all other packages, 17 new tests across decompose-to-ui-kit + verify-ui-kit-parity) Signed-off-by: homen <hshum2018@gmail.com>

…ension scoring Adds the vision-LLM judge counterpart to the existing deterministic verify_ui_kit_parity. Renders the decomposed ui_kits/<slug>/index.html in a hidden window via the host-injected renderUiKit callback, screenshots it, and asks a multimodal model to compare against the source artifact via the host-injected judgeVisualParity callback. Scoring is BOOLEAN-per-dimension, NOT floating-point — matches NodeBench's established rule patterns (pipeline_operational_standard.md 10-gate boolean catalog, eval_flywheel.md boolean evaluators, agent_run_verdict_workflow.md bounded enum verdicts). The judge answers 12 standard checks on every run (across layout / color / typography / content / components dimensions), each yes/no with an explicit reason string. The aggregate parityScore is DERIVED as passCount/totalChecks (never LLM-arbitrary). Status is bounded enum (verified / needs_review / needs_iteration / failed) thresholded deterministically: - 100% passed -> verified - >=85% passed -> needs_review - >=60% passed -> needs_iteration - <60% passed -> failed Why boolean over floating-point: lower judge variance (yes/no is harder to fudge than a number), every failure has a clear actionable reason, score is derived not LLM-arbitrary, comparable across runs/models/time. Failure-of-judge counts as failure-of-parity (HONEST_SCORES rule from agentic_reliability.md). Pattern mirrors generate-image-asset.ts: host injects two callbacks (renderUiKit, judgeVisualParity). Without them the tool returns status="unavailable" and the agent falls back to the deterministic verifier. decomposePrompt.ts (EN + ZH) updated to call BOTH verifiers and reconcile gaps before deciding to iterate or finish. 17 new unit tests cover: status thresholds across the verified/needs_review/ needs_iteration/failed bands, all-pass/all-fail/partial check sets, missing fs/render/judge callbacks, missing artifacts, missing source image, source image format validation, abort signal threading, and HONEST_SCORES guarantee that every check carries a reason. Verified: - pnpm lint clean - pnpm typecheck clean (10/10 workspace tasks) - pnpm test green (276 core including 17 new + 1026 desktop + others) Signed-off-by: homen <hshum2018@gmail.com>

…ks + toast feedback The verify_ui_kit_visual_parity tool was returning status="unavailable" because the host hadn't injected its two callbacks. This commit completes the wiring so the visual judge runs LIVE during decompose. Three new pieces: 1. apps/desktop/src/main/render-ui-kit.ts (~110 LOC) Hidden BrowserWindow + offscreen render + capturePage. Mirrors done-verify.ts pattern. Loads the decomposed ui_kits/<slug>/index.html in a sandboxed offscreen window, waits for did-finish-load + a 1500ms settle window for fonts/CSS, then captures a PNG and returns it as a base64 data URL. Honors AbortSignal + 12s hard timeout. 2. apps/desktop/src/main/judge-visual-parity.ts (~230 LOC) Vision-LLM judge with the same 12 standard boolean parity checks as the in-core tool. Decoupled from cfg plumbing — takes a runVisionPrompt callback the host wires using its existing generation pipeline. Asks the model to answer each check yes/no with a reason, parses defensively (code-fence strip + balanced-brace extract), returns structured per- check answers that the in-core tool normalizes into a deterministic parityScore + bounded-enum status. 3. apps/desktop/src/main/index.ts wiring Constructs both callbacks at runGenerate time and passes them to generateViaAgent's deps. The judge re-uses the SAME model/apiKey/ baseUrl/wire/capabilities as the active generation request, so we don't need a separate judge config — whatever model the user picked for generation is the model that judges parity. If the model isn't vision-capable the judge throws and the agent falls back to the deterministic verify_ui_kit_parity. Bonus: triggerDecompose store action now surfaces three toasts covering all branches (busy / no-artifact-yet / decomposing-now), with i18n keys in en + zh-CN. Previously the action silently no-op'd when conditions weren't met, which the user caught during dogfood. Verified: - pnpm lint clean (1 noShadowRestrictedNames fix on local `escape` var) - pnpm typecheck clean (10/10 workspace tasks) - pnpm test green (276 core + 1026 desktop + others) - Live-DOM dogfood with Playwright in browser-mode passed all 12 checks including the new menu item rendering and console-error-clean reload Signed-off-by: homen <hshum2018@gmail.com>

…t in done summary Two pieces, no defer: 1. docs/benchmarks/DECOMPOSE_TO_UI_KIT.md (~280 lines) - Full methodology: 4-stage pipeline + 12 standard boolean checks + status thresholds + cost methodology + cache key derivation - Real numbers from four cross-tier runs on the same NodeBench Reports source (cached): Opus reference, Pro+Pro with iteration loop demonstration, mixed Flash-Lite-decompose + Pro-judge, cheapest tier - Specific gap signal showing the verify-and-iterate loop climbing parity 0.69 -> 0.78 in one self-correcting round - Recommendation matrix: production / continuous-eval / CI-smoke - Reproducibility instructions with exact CLI commands - Honest non-claims section (no claim of universal parity, no claim gpt-image-1 mockups are production-quality, no claim cheap tier hits 0.85 first-pass) - Documented model failures (Kimi K2.6 truncation via OpenRouter, GLM 4.6V malformed JSON) - Citations to 2026 VLM-as-judge research + NodeBench's own internal boolean-evaluator rule patterns 2. decomposePrompt.ts updated (EN + ZH) — done summary MUST report: - Deterministic verifier passCount/totalChecks + status - Visual judge passCount/12 + status - Visual judge judgeCostUsd (this run's self-verify spend) - Remaining unfixed gaps with failed-check ids + why "Do NOT hide cost. Do NOT inflate scores. Failed checks count as failed." The cost surfacing is prompt-driven (the agent always reports it in chat) — orthogonal to a future UI cost meter, but ensures honest cost accounting today without renderer surgery. Verified: - pnpm lint clean - pnpm typecheck clean (10/10 workspace tasks) Signed-off-by: homen <hshum2018@gmail.com>

Tracks the boolean-rubric methodology + reproducible cross-tier results for verify_ui_kit_visual_parity. docs/ is gitignored per CLAUDE.md so this lives at repo root alongside README.md / CONTRIBUTING.md. Re-publishes the decompose-to-ui-kit benchmark previously committed to docs/benchmarks/ (which was silently dropped by .gitignore). Signed-off-by: homen <hshum2018@gmail.com>

When `verify_ui_kit_visual_parity` resolves, the renderer now reads `judgeCostUsd`, `passCount`, `totalChecks`, and `status` defensively from the structured ParityReport and pushes a toast — operator sees a per- decompose cost row without needing a new dashboard. Variant flips to `success` for verified/needs_review, `info` otherwise. Reads the result shape with bracket access so a future contract drift degrades to silent rather than crashing the renderer. New i18n keys `sidebar.decomposeJudgeResultTitle/Description` in en + zh-CN. README + README.zh-CN now mention the Decompose to UI Kit feature under "What's new" + Generation features so the entry point is discoverable from the repo landing. Signed-off-by: homen <hshum2018@gmail.com>

Side-by-side hero image: source.png (gpt-image input) vs rendered.png (agent-emitted ui_kit, headless-rendered) from the e2e-opus-final PoC run. parityScore badge (0.90) and status are derived deterministically from the 12-check boolean rubric — passCount / totalChecks — not an LLM-fabricated float. Hosted in this branch under website/public/screenshots/decompose-to-ui-kit.png so the github raw URL renders inline on github.com regardless of upstream merge state. Both README.md and README.zh-CN.md now embed the same image with a matching subcaption that calls out: real run (not mock), source -> rendered direction, and the deterministic-derivation invariant. Signed-off-by: homen <hshum2018@gmail.com>

4-frame reel from the e2e-nodebench-iter PoC run: 1. SOURCE gpt-image input 2. ITER-0 parityScore 0.82, status needs_iteration, 6 gaps surfaced 3. ITER-1 parityScore 0.78, status needs_iteration, 5 gaps surfaced 4. HONEST Δ score -0.04, Δ gaps -1 -- agent fixed some gaps but regressed on layout, boolean rubric exposes the drift instead of hiding it (HONEST_SCORES rule) Both gif (393.9 KB, 1080px wide, 10fps) and mp4 (224.2 KB, h.264 yuv420p) shipped under website/public/demos/ to match the existing demo asset convention. README.md + README.zh-CN.md embed the gif inline directly under the side-by-side hero with subcaption explaining the deliberate choice to show drift, plus a link to the mp4 for quality-sensitive viewers. Hosted in this branch so the github raw URL renders inline on github.com regardless of upstream merge state. Signed-off-by: homen <hshum2018@gmail.com>

Closes Phase 1 of OpenCoworkAI#225. Signed-off-by: homen <hshum2018@gmail.com>

github-actions

Findings

[Major] Decompose loop success can never trigger on the first clean pass — decomposePrompt.ts requires both verifiers to return verified or needs_review, but the deterministic verifier only returns ok or needs_iteration. That forces an unnecessary extra iteration even when deterministic parity already passed, which adds avoidable cost and can regress a good bundle, evidence apps/desktop/src/renderer/src/hooks/decomposePrompt.ts:67, packages/core/src/tools/verify-ui-kit-parity.ts:35, packages/core/src/tools/verify-ui-kit-parity.ts:294.
Suggested fix:
```
const deterministicPass = deterministic.status === 'ok';
const visualPass =
  visual.status === 'unavailable' ||
  visual.status === 'verified' ||
  visual.status === 'needs_review';
if (deterministicPass && visualPass) {
  // call done
}
```
[Major] verify_ui_kit_visual_parity({slug}) has no source image on the default runtime path — the tool defaults to source.png, but the agent FS is initialized with index.html, frames, and skills only, while preparePromptContext() keeps attachments in prompt context instead of persisting them into the virtual FS. In normal runs the visual verifier therefore degrades to unavailable instead of actually judging parity, evidence packages/core/src/tools/verify-ui-kit-visual-parity.ts:291, apps/desktop/src/main/index.ts:294, apps/desktop/src/main/index.ts:879, apps/desktop/src/main/prompt-context.ts:287.
Suggested fix:
```
const firstImage = promptContext.attachments.find((a) => a.imageDataUrl);
if (firstImage?.imageDataUrl) {
  await fs.create('source.png', firstImage.imageDataUrl);
}
```
[Major] Judge/render failures do not fall back to a structured result — makeJudgeVisualParity() throws on empty or non-JSON model replies, and verify_ui_kit_visual_parity awaits both renderUiKit() and judgeVisualParity() without catching those failures. On a text-only model or malformed judge response, the tool errors instead of returning the advertised status: "unavailable" path, evidence apps/desktop/src/main/judge-visual-parity.ts:153, apps/desktop/src/main/judge-visual-parity.ts:197, packages/core/src/tools/verify-ui-kit-visual-parity.ts:311.
Suggested fix:
```
try {
  const candidateImg = await renderUiKit(decomposed.content, signal);
  const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal);
  // existing normalization...
} catch (error) {
  const report = unavailableReport(
    error instanceof Error ? error.message : String(error),
  );
  return { content: [{ type: 'text', text: report.summary }], details: report };
}
```
[Major] This still looks like partial work for #225, so Closes #225 is misleading — the public PR template says to use Closes only when the issue is fully resolved, but this diff stops at emitting a ui_kits/<slug>/ handoff bundle and explicitly tells the agent not to continue into the downstream prototype flow, evidence .github/PULL_REQUEST_TEMPLATE.md:11, .changeset/decompose-to-ui-kit.md:9, apps/desktop/src/renderer/src/hooks/decomposePrompt.ts:58, packages/core/src/tools/decompose-to-ui-kit.ts:155.
Suggested fix:
```
Refs #225
```

Summary

Review mode: initial
Found 4 issues: one decompose-loop contract bug, two visual-verifier runtime gaps, and one incomplete issue-closure claim.

Testing

Not run (automation). Suggested: add one agent-runtime integration test that seeds an image attachment into the virtual FS and asserts verify_ui_kit_visual_parity can read it, plus one unit test that judge/render failures return a structured fallback instead of throwing.

open-codesign Bot

github-actions · 2026-04-26T22:41:25Z

+   - If it returns status="unavailable", the host hasn't injected the judge callback. Proceed with step 4's deterministic report alone.
+   - If it returns successfully, read each checks[].passed + reason. Failed checks are the things to fix.
+6. Reconcile both reports:
+   - Both status ∈ {verified, needs_review} (12/12 or 11/12 checks passed): call done


verify_ui_kit_parity never returns verified or needs_review, so the success branch described here can never fire. Even a clean deterministic pass gets forced into the iteration path and burns another full decompose cycle.

Suggested fix:

const deterministicPass = deterministic.status === 'ok'; const visualPass = visual.status === 'unavailable' || visual.status === 'verified' || visual.status === 'needs_review';

github-actions · 2026-04-26T22:41:25Z

+        return { content: [{ type: 'text', text: report.summary }], details: report };
+      }
+
+      const sourcePath = params.sourceImagePath ?? 'source.png';


This tool defaults to source.png, but nothing in the runtime seeds an attached image into the agent FS before the tool runs. On the default path that means the visual verifier returns unavailable instead of ever judging parity.

Suggested fix:

const firstImage = promptContext.attachments.find((a) => a.imageDataUrl); if (firstImage?.imageDataUrl) { await fs.create('source.png', firstImage.imageDataUrl); }

github-actions · 2026-04-26T22:41:25Z

+      };
+
+      logger.info('[verify_ui_kit_visual_parity] step=render', { slug: params.slug });
+      const candidateImg = await renderUiKit(decomposed.content, signal);


The fallback described in the prompt only works for missing dependencies/files. If renderUiKit() or judgeVisualParity() throws (for example on a text-only model or malformed JSON), this tool currently errors instead of returning a structured unavailable/failed report.

Suggested fix:

try { const candidateImg = await renderUiKit(decomposed.content, signal); const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal); } catch (error) { const report = unavailableReport(error instanceof Error ? error.message : String(error)); return { content: [{ type: 'text', text: report.summary }], details: report }; }

github-actions · 2026-04-26T22:41:25Z

+
+Add **Decompose to UI Kit** — one-click in the chat sidebar emits a `ui_kits/<slug>/` folder shaped for coding-agent handoff (`index.html` + `components/*.tsx` + `tokens.css` + `manifest.json` + `README.md`). Built-in deterministic + vision verifiers self-check parity using a 12-question boolean rubric (`parityScore = passCount / totalChecks`, no LLM-fabricated floats) and re-iterate on gaps. Per-decompose cost surfaces inline as a toast.
+
+Closes Phase 1 of #225.


This release note says the phase is closed, but the shipped flow still stops at producing a handoff bundle and explicitly does not continue into the downstream prototype step. Per the public PR template, this should be Refs #225 unless the full issue scope is implemented here.

Suggested fix:

Refs #225

…itespace CodeQL js/bad-tag-filter (HIGH) on PR OpenCoworkAI#241: the literal `<\/script>` / `<\/style>` patterns in stripTags() left bodies behind for HTML5-tolerated end-tag forms like `</script >` (trailing space) and `</script foo="bar">` (end-tag attributes, silently ignored by browsers per spec). A crafted source HTML could leak script/style body text into the visible-word vocabulary used for parity coverage scoring. Fix: mirror the opening-tag pattern's `\b[^>]*` on the close tag too. The `\b` after the tag name prevents over-matching `</scripts>` while the `[^>]*` consumes any tolerated end-tag content up to the closing `>`. Regression test covers all 4 previously-vulnerable forms: - `</script >` (trailing whitespace) - `</script foo="bar">` (end-tag attrs) - `</style >` (style branch) - `</SCRIPT>` (case) Asserts none of 4 leaked tokens appear in the parity report when the decomposition correctly omits the script/style content. Signed-off-by: homen <hshum2018@gmail.com>

github-actions

Findings

[Major] Decompose loop success can never trigger on the first clean pass — decomposePrompt.ts still requires both verifiers to return verified or needs_review, but verify_ui_kit_parity only returns ok or needs_iteration. A clean deterministic pass therefore gets forced into another full decompose cycle, which adds avoidable cost and can regress a good bundle, evidence apps/desktop/src/renderer/src/hooks/decomposePrompt.ts:72, packages/core/src/tools/verify-ui-kit-parity.ts:35, packages/core/src/tools/verify-ui-kit-parity.ts:300.
Suggested fix:
```
const deterministicPass = deterministic.status === 'ok';
const visualPass =
  visual.status === 'unavailable' ||
  visual.status === 'verified' ||
  visual.status === 'needs_review';
if (deterministicPass && visualPass) {
  // call done
}
```
[Major] verify_ui_kit_visual_parity({slug}) still has no source image on the default runtime path — the tool defaults to source.png, but the runtime only prepares attachment context and passes images to the model wire; it never persists an attached image into the agent FS. On the normal path the visual verifier therefore returns unavailable instead of actually judging parity, evidence packages/core/src/tools/verify-ui-kit-visual-parity.ts:291, apps/desktop/src/main/prompt-context.ts:287, apps/desktop/src/main/index.ts:914, packages/core/src/index.ts:203.
Suggested fix:
```
const firstImage = promptContext.attachments.find((a) => a.imageDataUrl);
if (firstImage?.imageDataUrl) {
  await fs.create('source.png', firstImage.imageDataUrl);
}
```
[Major] Judge/render failures still bypass the advertised structured fallback — makeJudgeVisualParity() throws on empty or non-JSON model replies, and verify_ui_kit_visual_parity still awaits both renderUiKit() and judgeVisualParity() without catching those failures. On a text-only model or malformed judge response, the tool errors instead of returning the promised status: "unavailable" path, evidence apps/desktop/src/main/judge-visual-parity.ts:198, apps/desktop/src/main/judge-visual-parity.ts:210, packages/core/src/tools/verify-ui-kit-visual-parity.ts:312.
Suggested fix:
```
try {
  const candidateImg = await renderUiKit(decomposed.content, signal);
  const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal);
  // existing normalization...
} catch (error) {
  const report = unavailableReport(
    error instanceof Error ? error.message : String(error),
  );
  return { content: [{ type: 'text', text: report.summary }], details: report };
}
```
[Major] The issue-closing claim is still overstated for this slice — the public PR template says to use Closes only when the issue is fully resolved, but this diff still stops at emitting a ui_kits/<slug>/ handoff bundle and explicitly tells the agent not to continue into the downstream prototype step, evidence .github/PULL_REQUEST_TEMPLATE.md:11, .changeset/decompose-to-ui-kit.md:9, apps/desktop/src/renderer/src/hooks/decomposePrompt.ts:58, packages/core/src/tools/decompose-to-ui-kit.ts:155.
Suggested fix:
```
Refs #225
```

Summary

Review mode: follow-up after new commits. The new commit fixes the script/style stripping issue, but these 4 major issues are still present on the latest head.

Testing

Not run (automation). Suggested: add one agent-runtime integration test that seeds an image attachment into the virtual FS and asserts verify_ui_kit_visual_parity can read it, plus one unit test that render/judge failures return a structured fallback instead of throwing.

open-codesign Bot

github-actions · 2026-04-26T23:23:52Z

+   - If it returns status="unavailable", the host hasn't injected the judge callback. Proceed with step 4's deterministic report alone.
+   - If it returns successfully, read each checks[].passed + reason. Failed checks are the things to fix.
+6. Reconcile both reports:
+   - Both status ∈ {verified, needs_review} (12/12 or 11/12 checks passed): call done


verify_ui_kit_parity still never returns verified or needs_review, so this success branch can never fire. Even a clean deterministic pass gets forced into another full decompose cycle.

Suggested fix:

const deterministicPass = deterministic.status === 'ok'; const visualPass = visual.status === 'unavailable' || visual.status === 'verified' || visual.status === 'needs_review'; if (deterministicPass && visualPass) { // call done }

github-actions · 2026-04-26T23:23:52Z

+        return { content: [{ type: 'text', text: report.summary }], details: report };
+      }
+
+      const sourcePath = params.sourceImagePath ?? 'source.png';


This tool still defaults to source.png, but nothing in the runtime persists an attached image into the agent FS before the tool runs. On the default path that means the visual verifier returns unavailable instead of ever judging parity.

Suggested fix:

const firstImage = promptContext.attachments.find((a) => a.imageDataUrl); if (firstImage?.imageDataUrl) { await fs.create('source.png', firstImage.imageDataUrl); }

github-actions · 2026-04-26T23:23:52Z

+      };
+
+      logger.info('[verify_ui_kit_visual_parity] step=render', { slug: params.slug });
+      const candidateImg = await renderUiKit(decomposed.content, signal);


The fallback described in the prompt still only covers missing dependencies/files. If renderUiKit() or judgeVisualParity() throws (for example on a text-only model or malformed JSON), this tool errors instead of returning a structured unavailable report.

Suggested fix:

try { const candidateImg = await renderUiKit(decomposed.content, signal); const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal); } catch (error) { const report = unavailableReport(error instanceof Error ? error.message : String(error)); return { content: [{ type: 'text', text: report.summary }], details: report }; }

github-actions · 2026-04-26T23:23:52Z

+
+Add **Decompose to UI Kit** — one-click in the chat sidebar emits a `ui_kits/<slug>/` folder shaped for coding-agent handoff (`index.html` + `components/*.tsx` + `tokens.css` + `manifest.json` + `README.md`). Built-in deterministic + vision verifiers self-check parity using a 12-question boolean rubric (`parityScore = passCount / totalChecks`, no LLM-fabricated floats) and re-iterate on gaps. Per-decompose cost surfaces inline as a toast.
+
+Closes Phase 1 of #225.


This release note still says the phase is closed, but the shipped flow stops at producing a handoff bundle and explicitly does not continue into the downstream prototype step. Per the public PR template, this should be Refs #225 unless the full issue scope is implemented here.

Suggested fix:

Refs #225

HomenShum added 10 commits April 25, 2026 21:07

chore: add changeset for decompose-to-ui-kit feature

8cf6797

Closes Phase 1 of OpenCoworkAI#225. Signed-off-by: homen <hshum2018@gmail.com>

github-actions Bot added docs Documentation area:desktop apps/desktop (Electron shell, renderer) area:core packages/core (generation orchestration) labels Apr 26, 2026

github-advanced-security AI found potential problems Apr 26, 2026

View reviewed changes

Comment thread packages/core/src/tools/verify-ui-kit-parity.ts Fixed

HomenShum mentioned this pull request Apr 26, 2026

[Feature]: image 2 已经够厉害了，最需要的是如何把生成好的UI 变成组件化，再到原型的过程！ #225

Open

1 task

github-actions Bot reviewed Apr 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): add decompose-to-ui-kit + boolean parity verifiers (Phase 1 of #225)#241

feat(core): add decompose-to-ui-kit + boolean parity verifiers (Phase 1 of #225)#241
HomenShum wants to merge 11 commits intoOpenCoworkAI:mainfrom
HomenShum:feat/decompose-to-ui-kit

HomenShum commented Apr 26, 2026

Uh oh!

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Apr 26, 2026

Uh oh!

github-actions Bot Apr 26, 2026

Uh oh!

github-actions Bot Apr 26, 2026

Uh oh!

github-actions Bot Apr 26, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Apr 26, 2026

Uh oh!

github-actions Bot Apr 26, 2026

Uh oh!

github-actions Bot Apr 26, 2026

Uh oh!

github-actions Bot Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Add Decompose to UI Kit — one-click in the chat sidebar emits a `ui_kits/<slug>/` folder shaped for coding-agent handoff (`index.html` + `components/*.tsx` + `tokens.css` + `manifest.json` + `README.md`). Built-in deterministic + vision verifiers self-check parity using a 12-question boolean rubric (`parityScore = passCount / totalChecks`, no LLM-fabricated floats) and re-iterate on gaps. Per-decompose cost surfaces inline as a toast.

		Closes Phase 1 of #225.

Conversation

HomenShum commented Apr 26, 2026

Summary

Type of change

Linked issue

What's in here

Design decisions

Anti-hallucination guardrails

Cost surfacing

Checklist

Dependency additions (if any)

Screenshots / recordings (UI changes)

Cross-tier benchmarks

Scope discipline notes

Branch state at PR open

Why this is ready to review now

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants