fix: Route visual grounding fallback through a dedicated feature and prompt#87
Conversation
The fallback was re-calling FEATURE_GROUNDER with the hierarchy stripped,
which re-uses a prompt built around matching ui_elements. When an element
is visible in the screenshot but missing from the hierarchy, that call
reliably fails, the tap never happens, and the planner correctly re-plans
the same act -- producing a 110-iteration loop on Visual grounding failed.
Introduce FEATURE_VISUAL_GROUNDER with a simpler coordinate-only prompt
(act + screenshot -> {x, y, reason}). VisualGrounder now calls the new
feature instead of reusing the primary grounder.
📝 WalkthroughWalkthroughIntroduces a new visual grounder feature constant and wires it through the goal executor system. The constant Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
|
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/goal-executor/src/prompts/visual-grounder.md (1)
20-22: Unify the output contract wording to a single JSON shape.Line 20–22 describes top-level
{x, y, reason}/{isError: true, reason}, while Line 29–37 requires{"output": {...}}. Keeping one canonical shape will reduce model drift.Proposed prompt wording adjustment
-- If the element is clearly visible: return `{x, y, reason}` with the center coordinates. -- If the element is not visible at all: return `{isError: true, reason}`. Do not guess coordinates. +- If the element is clearly visible: return `{"output":{"x":<int>,"y":<int>,"reason":"..."}}`. +- If the element is not visible at all: return `{"output":{"isError":true,"reason":"..."}}`. Do not guess coordinates.Also applies to: 29-37
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/goal-executor/src/prompts/visual-grounder.md` around lines 20 - 22, Unify the response contract in visual-grounder.md so the prompt always expects a single JSON top-level shape: return {"output": {...}}; inside "output" include either {x, y, reason} for a visible element or {isError: true, reason} when not visible or uncertain. Update all guidance sentences (the earlier lines that mention `{x, y, reason}` / `{isError: true, reason}` and the later lines that reference `{"output": {...}}`) to use this canonical shape and wording, and ensure examples and instructions consistently reference the "output" wrapper and the two allowed inner objects.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/goal-executor/src/prompts/visual-grounder.md`:
- Around line 20-22: Unify the response contract in visual-grounder.md so the
prompt always expects a single JSON top-level shape: return {"output": {...}};
inside "output" include either {x, y, reason} for a visible element or {isError:
true, reason} when not visible or uncertain. Update all guidance sentences (the
earlier lines that mention `{x, y, reason}` / `{isError: true, reason}` and the
later lines that reference `{"output": {...}}`) to use this canonical shape and
wording, and ensure examples and instructions consistently reference the
"output" wrapper and the two allowed inner objects.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 77479e21-176c-460d-ac38-f7e38af41b59
📒 Files selected for processing (5)
packages/common/src/constants.tspackages/goal-executor/src/ActionExecutor.test.tspackages/goal-executor/src/ai/AIAgent.tspackages/goal-executor/src/ai/VisualGrounder.tspackages/goal-executor/src/prompts/visual-grounder.md
Summary
When the grounder returns
needsVisualGrounding: true, the fallback was re-callingFEATURE_GROUNDERwith the hierarchy stripped. That prompt is built around matchingui_elementsby index, so the call reliably fails on canvas/WebView targets — the tap never happens, the screen doesn't change, and the planner correctly re-plans the sameact. Result: a 110-iteration loop onVisual grounding failed.This PR introduces a dedicated
FEATURE_VISUAL_GROUNDERwith a simpler coordinate-only prompt.VisualGroundernow calls the new feature instead of reusing the primary grounder, so the fallback gets a prompt designed to return{x, y, reason}from a screenshot + description.Changes
packages/common/src/constants.ts— addFEATURE_VISUAL_GROUNDER = 'visual-grounder'.packages/goal-executor/src/prompts/visual-grounder.md— new minimal prompt:act+ screenshot in,{x, y, reason}or{isError, reason}out.packages/goal-executor/src/ai/AIAgent.ts— import the new constant; route it to the new prompt in_getPromptKeyForFeature.packages/goal-executor/src/ai/VisualGrounder.ts— switchfeaturefromFEATURE_GROUNDERtoFEATURE_VISUAL_GROUNDER; drop the stale "no hierarchy" trick comment.packages/goal-executor/src/ActionExecutor.test.ts— existing visual-fallback test now asserts the feature name on both calls (FEATURE_GROUNDERfor primary,FEATURE_VISUAL_GROUNDERfor fallback).Test plan
npm test --workspace=packages/goal-executor— the three visual-grounding tests pass (records visual fallback in trace, preserves ground failure when no fallback, surfaces terminal provider failures from fallback). Three unrelated Gemini-thinking-level tests fail onmaintoo — not touched here.action.ground(feature=grounder) followed byaction.visual_fallback(feature=visual-grounder).{isError: true}; confirm the step fails once with a clear reason instead of spinning.Follow-up
Backend model-config should get an entry for the
visual-grounderfeature so it's independently tunable. Until then, unknown features fall back to the default model config on the backend, which still works but isn't independently configurable.Summary by CodeRabbit
Release Notes