Skip to content

Enable Gemini-3.5-flash cua#2273

Open
miguelg719 wants to merge 5 commits into
mainfrom
miguelgonzalez/gemini-3-5-flash-cua
Open

Enable Gemini-3.5-flash cua#2273
miguelg719 wants to merge 5 commits into
mainfrom
miguelgonzalez/gemini-3-5-flash-cua

Conversation

@miguelg719

@miguelg719 miguelg719 commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

why

what changed

test plan


Summary by cubic

Add support for the google/gemini-3.5-flash computer-use agent. Normalizes Gemini 3.x function names/args, preserves click-family semantics, validates click coordinates, and always returns a fresh screenshot each turn while tracking reasoning and cached tokens.

  • New Features

    • Enable google/gemini-3.5-flash in agent/LLM provider maps and public types; update tests.
    • Map Gemini 3.x functions to 2.5 handlers and accept new arg shapes: coordinate-less type, keys array/single key, magnitude_in_pixels for scroll, drag start/end pairs, and screenshot as take_screenshot.
    • Always return a screenshot function response, even when no executable actions are produced.
  • Bug Fixes

    • Track reasoning_tokens and cached_input_tokens in Google CUA usage and aggregate metrics.
    • Preserve Gemini 3.x click-family semantics by mapping double_click, triple_click, right_click, middle_click, and move to native actions.
    • Validate click-family coordinates; drop calls with missing x/y instead of passing NaN, with unit tests for conversion.

Written for commit c6e675c. Summary will update on new commits.

Review in cubic

miguelg719 and others added 3 commits June 23, 2026 10:15
Gemini 3.x emits predefined function names and argument shapes that
differ from the 2.5 computer-use vocabulary. Map the 3.x names onto the
canonical 2.5 handlers, tolerate the new argument shapes (coordinate-less
type, keys arrays, scroll magnitude_in_pixels, drag start/end pairs),
treat take_screenshot as a recognized no-op, and always return a
screenshot function response even when a turn produced no executable
actions so the model is never left without an observation.

Only the click/take_screenshot aliases and click/navigate argument
shapes were confirmed from live gemini-3.5-flash traffic; the remaining
aliases follow the same drop-the-qualifier pattern and fall through to
the existing unknown-action warning if wrong.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
GoogleCUAClient read only promptTokenCount/candidatesTokenCount, dropping
Gemini's cachedContentTokenCount and thoughtsTokenCount — so cached_input_tokens
and reasoning_tokens were always 0 in agent metrics even though the CUA
handler and updateMetrics already plumb them through. Surface both per step
and in the aggregated usage.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@changeset-bot

changeset-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: c6e675c

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@browserbasehq/stagehand Minor
@browserbasehq/stagehand-evals Patch
@browserbasehq/stagehand-server-v3 Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 6 files

Confidence score: 2/5

  • In packages/core/lib/v3/agent/GoogleCUAClient.ts, double_click/triple_click are collapsed to click_at without preserving click count, so intended multi-click interactions execute as single clicks and can break Gemini-3.x tasks that depend on double/triple click semantics — pass through and honor the click count before merging.
  • In packages/core/lib/v3/agent/GoogleCUAClient.ts, right/middle click and mouse down/up actions are still unimplemented, so model-emitted click-family calls can silently no-op and leave automation flows stuck or incorrect — implement these handlers (or explicitly gate/fail fast) before merging.
  • In packages/core/lib/v3/agent/AgentProvider.ts, extending hardcoded model-to-provider mappings keeps model onboarding tied to code changes, increasing regression risk whenever new models are introduced — switch to provider-derived/dynamic resolution instead of expanding allowlists.
  • In packages/core/lib/v3/llm/LLMProvider.ts, adding support via deprecated unprefixed model IDs prolongs a legacy path and can create inconsistent model resolution behavior — route new support through provider/model IDs and avoid expanding the deprecated mapping.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/lib/v3/agent/GoogleCUAClient.ts">

<violation number="1" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:845">
P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</violation>
</file>
Architecture diagram
sequenceDiagram
    participant App as Agent Loop (run)
    participant Client as GoogleCUAClient (executeStep)
    participant Mapper as convertFunctionCallToAction
    participant Exec as Action Executor
    participant SS as Screenshot Capture
    participant API as Google Gemini API

    Note over App,API: Gemini-3.5-flash CUA turn (happy path)

    App->>Client: executeStep(context, logger)
    Client->>API: send request (history + screenshot)
    API-->>Client: response (functionCalls, usageMetadata)

    alt functionCall has predefined function
        Client->>Mapper: for each part.functionCall
        Note over Mapper: NAME_ALIASES maps 3.x names → 2.5 canonicals<br/>e.g. "click" → "click_at", "type" → "type_text_at"
        Mapper->>Mapper: normalize args shape<br/>(keys: string|array, scroll: magnitudeInPixels,<br/>drag: start/end, type: optional coords)
        Mapper-->>Client: normalized AgentAction (e.g. type, click, screenshot)
    end

    alt action.type === "screenshot"
        Client->>Client: log "take_screenshot: capturing current page"<br/>no browser interaction
    else action.type === "type" AND coordinates present
        Client->>Exec: click (x,y left)
        Client->>Exec: select all (if clearBeforeTyping)
        Client->>Exec: type text
    else action.type === "type" AND no coordinates
        Client->>Exec: type text directly<br/>(element already focused)
    else other executable actions (click_at, scroll_at, etc.)
        Client->>Exec: execute action via browser
    end

    Note over Client: Always capture fresh screenshot after processing actions<br/>(even if no executable actions, e.g. only take_screenshot)

    Client->>SS: captureScreenshot()
    SS-->>Client: screenshot bytes

    Client->>Client: build functionResponses: [screenshot part]

    Client->>API: turn call with functionResponses
    API-->>Client: next turn result (final or continue)

    Client->>Client: aggregate usage (input_tokens, output_tokens,<br/>reasoning_tokens, cached_input_tokens, inference_time_ms)
    Note over Client: reasoning_tokens = usageMetadata.thoughtsTokenCount<br/>cached_input_tokens = usageMetadata.cachedContentTokenCount

    Client-->>App: StepResult with actions, message, usage

    App->>App: accumulate totals for all steps
    App-->>App: final response with full usage (including reasoning, cached)
Loading

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread packages/core/lib/v3/agent/GoogleCUAClient.ts Outdated
Comment thread packages/core/lib/v3/agent/AgentProvider.ts
// NOTE: click and take_screenshot are confirmed from live gemini-3.5-flash
// traffic; the rest are inferred from the same drop-the-qualifier pattern
// and are safe aliases (any unmapped name still hits the warning below).
const NAME_ALIASES: Record<string, string> = {

@cubic-dev-ai cubic-dev-ai Bot Jun 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/lib/v3/agent/GoogleCUAClient.ts, line 845:

<comment>Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</comment>

<file context>
@@ -794,19 +824,62 @@ export class GoogleCUAClient extends AgentClient {
+    // NOTE: click and take_screenshot are confirmed from live gemini-3.5-flash
+    // traffic; the rest are inferred from the same drop-the-qualifier pattern
+    // and are safe aliases (any unmapped name still hits the warning below).
+    const NAME_ALIASES: Record<string, string> = {
+      click: "click_at",
+      left_click: "click_at",
</file context>
Fix with cubic

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment thread packages/core/lib/v3/llm/LLMProvider.ts
Per the Gemini 3.5 Flash computer-use spec, double_click/triple_click/
right_click/middle_click/move are distinct predefined functions. The
converter collapsed double/triple click to a single left click and left
right/middle click + move unmapped (silent no-op). Map them to the
executor's native double_click/triple_click/move actions and click with
the right button. gemini-2.5 emits none of these names, so its canonical
handlers are unaffected.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="packages/core/lib/v3/agent/GoogleCUAClient.ts">

<violation number="1" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:845">
P2: Gemini-3.x click-family actions are only partially implemented; right/middle click and mouse down/up are still unhandled. When the model emits these function calls, the agent logs unsupported and performs no action.</violation>

<violation number="2" location="packages/core/lib/v3/agent/GoogleCUAClient.ts:926">
P3: New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.

(Based on your team's feedback about adding unit tests for new behavior.) [FEEDBACK_USED].</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread packages/core/lib/v3/agent/GoogleCUAClient.ts
@@ -241,6 +241,8 @@ export class GoogleCUAClient extends AgentClient {

@cubic-dev-ai cubic-dev-ai Bot Jun 25, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.

(Based on your team's feedback about adding unit tests for new behavior.) .

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/core/lib/v3/agent/GoogleCUAClient.ts, line 926:

<comment>New Gemini click-family behavior lacks focused unit tests for conversion semantics and edge cases. Add tests that assert produced AgentAction types/buttons/coordinates for double/triple/right/middle/move.

(Based on your team's feedback about adding unit tests for new behavior.) .</comment>

<file context>
@@ -893,6 +897,40 @@ export class GoogleCUAClient extends AgentClient {
+        };
+      }
+
+      case "move": {
+        const { x, y } = this.normalizeCoordinates(
+          args.x as number,
</file context>
Fix with cubic

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in c6e675c

Guard the gemini-3.x click-family cases (double/triple/right/middle click,
move) so a payload missing x/y returns null instead of normalizing NaN
into the executor, matching drag_and_drop. Add focused unit tests asserting
the produced AgentAction type/button/coordinates for each, the missing-coord
null path, and 2.5 click_at backcompat.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant