Enable Gemini-3.5-flash cua#2273
Conversation
Gemini 3.x emits predefined function names and argument shapes that differ from the 2.5 computer-use vocabulary. Map the 3.x names onto the canonical 2.5 handlers, tolerate the new argument shapes (coordinate-less type, keys arrays, scroll magnitude_in_pixels, drag start/end pairs), treat take_screenshot as a recognized no-op, and always return a screenshot function response even when a turn produced no executable actions so the model is never left without an observation. Only the click/take_screenshot aliases and click/navigate argument shapes were confirmed from live gemini-3.5-flash traffic; the remaining aliases follow the same drop-the-qualifier pattern and fall through to the existing unknown-action warning if wrong. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
GoogleCUAClient read only promptTokenCount/candidatesTokenCount, dropping Gemini's cachedContentTokenCount and thoughtsTokenCount — so cached_input_tokens and reasoning_tokens were always 0 in agent metrics even though the CUA handler and updateMetrics already plumb them through. Surface both per step and in the aggregated usage. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: c6e675c The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
4 issues found across 6 files
Confidence score: 2/5
- In
packages/core/lib/v3/agent/GoogleCUAClient.ts,double_click/triple_clickare collapsed toclick_atwithout preserving click count, so intended multi-click interactions execute as single clicks and can break Gemini-3.x tasks that depend on double/triple click semantics — pass through and honor the click count before merging. - In
packages/core/lib/v3/agent/GoogleCUAClient.ts, right/middle click and mouse down/up actions are still unimplemented, so model-emitted click-family calls can silently no-op and leave automation flows stuck or incorrect — implement these handlers (or explicitly gate/fail fast) before merging. - In
packages/core/lib/v3/agent/AgentProvider.ts, extending hardcoded model-to-provider mappings keeps model onboarding tied to code changes, increasing regression risk whenever new models are introduced — switch to provider-derived/dynamic resolution instead of expanding allowlists. - In
packages/core/lib/v3/llm/LLMProvider.ts, adding support via deprecated unprefixed model IDs prolongs a legacy path and can create inconsistent model resolution behavior — route new support throughprovider/modelIDs and avoid expanding the deprecated mapping.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/core/lib/v3/agent/GoogleCUAClient.ts">
<violation number="1" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:845">
P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</violation>
</file>
Architecture diagram
sequenceDiagram
participant App as Agent Loop (run)
participant Client as GoogleCUAClient (executeStep)
participant Mapper as convertFunctionCallToAction
participant Exec as Action Executor
participant SS as Screenshot Capture
participant API as Google Gemini API
Note over App,API: Gemini-3.5-flash CUA turn (happy path)
App->>Client: executeStep(context, logger)
Client->>API: send request (history + screenshot)
API-->>Client: response (functionCalls, usageMetadata)
alt functionCall has predefined function
Client->>Mapper: for each part.functionCall
Note over Mapper: NAME_ALIASES maps 3.x names → 2.5 canonicals<br/>e.g. "click" → "click_at", "type" → "type_text_at"
Mapper->>Mapper: normalize args shape<br/>(keys: string|array, scroll: magnitudeInPixels,<br/>drag: start/end, type: optional coords)
Mapper-->>Client: normalized AgentAction (e.g. type, click, screenshot)
end
alt action.type === "screenshot"
Client->>Client: log "take_screenshot: capturing current page"<br/>no browser interaction
else action.type === "type" AND coordinates present
Client->>Exec: click (x,y left)
Client->>Exec: select all (if clearBeforeTyping)
Client->>Exec: type text
else action.type === "type" AND no coordinates
Client->>Exec: type text directly<br/>(element already focused)
else other executable actions (click_at, scroll_at, etc.)
Client->>Exec: execute action via browser
end
Note over Client: Always capture fresh screenshot after processing actions<br/>(even if no executable actions, e.g. only take_screenshot)
Client->>SS: captureScreenshot()
SS-->>Client: screenshot bytes
Client->>Client: build functionResponses: [screenshot part]
Client->>API: turn call with functionResponses
API-->>Client: next turn result (final or continue)
Client->>Client: aggregate usage (input_tokens, output_tokens,<br/>reasoning_tokens, cached_input_tokens, inference_time_ms)
Note over Client: reasoning_tokens = usageMetadata.thoughtsTokenCount<br/>cached_input_tokens = usageMetadata.cachedContentTokenCount
Client-->>App: StepResult with actions, message, usage
App->>App: accumulate totals for all steps
App-->>App: final response with full usage (including reasoning, cached)
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| // NOTE: click and take_screenshot are confirmed from live gemini-3.5-flash | ||
| // traffic; the rest are inferred from the same drop-the-qualifier pattern | ||
| // and are safe aliases (any unmapped name still hits the warning below). | ||
| const NAME_ALIASES: Record<string, string> = { |
There was a problem hiding this comment.
P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/lib/v3/agent/GoogleCUAClient.ts, line 845:
<comment>Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</comment>
<file context>
@@ -794,19 +824,62 @@ export class GoogleCUAClient extends AgentClient {
+ // NOTE: click and take_screenshot are confirmed from live gemini-3.5-flash
+ // traffic; the rest are inferred from the same drop-the-qualifier pattern
+ // and are safe aliases (any unmapped name still hits the warning below).
+ const NAME_ALIASES: Record<string, string> = {
+ click: "click_at",
+ left_click: "click_at",
</file context>
Per the Gemini 3.5 Flash computer-use spec, double_click/triple_click/ right_click/middle_click/move are distinct predefined functions. The converter collapsed double/triple click to a single left click and left right/middle click + move unmapped (silent no-op). Map them to the executor's native double_click/triple_click/move actions and click with the right button. gemini-2.5 emits none of these names, so its canonical handlers are unaffected. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
2 issues found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/core/lib/v3/agent/GoogleCUAClient.ts">
<violation number="1" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:845">
P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</violation>
<violation number="2" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:926">
P3: New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.
(Based on your team's feedback about adding unit tests for new behavior.) [FEEDBACK_USED].</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| @@ -241,6 +241,8 @@ export class GoogleCUAClient extends AgentClient { | |||
|
|
|||
There was a problem hiding this comment.
P3: New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.
(Based on your team's feedback about adding unit tests for new behavior.) .
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/lib/v3/agent/GoogleCUAClient.ts, line 926:
<comment>New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.
(Based on your team's feedback about adding unit tests for new behavior.) .</comment>
<file context>
@@ -893,6 +897,40 @@ export class GoogleCUAClient extends AgentClient {
+ };
+ }
+
+ case "move": {
+ const { x, y } = this.normalizeCoordinates(
+ args.x as number,
</file context>
Guard the gemini-3.x click-family cases (double/triple/right/middle click, move) so a payload missing x/y returns null instead of normalizing NaN into the executor, matching drag_and_drop. Add focused unit tests asserting the produced AgentAction type/button/coordinates for each, the missing-coord null path, and 2.5 click_at backcompat. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
why
what changed
test plan
Summary by cubic
Add support for the
google/gemini-3.5-flashcomputer-use agent. Normalizes Gemini 3.x function names/args, preserves click-family semantics, validates click coordinates, and always returns a fresh screenshot each turn while tracking reasoning and cached tokens.New Features
google/gemini-3.5-flashin agent/LLM provider maps and public types; update tests.type, keys array/single key,magnitude_in_pixelsfor scroll, drag start/end pairs, andscreenshotastake_screenshot.Bug Fixes
reasoning_tokensandcached_input_tokensin Google CUA usage and aggregate metrics.double_click,triple_click,right_click,middle_click, andmoveto native actions.Written for commit c6e675c. Summary will update on new commits.