From 28258149b8509f14c53e7287d2928acba5a159e6 Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Sun, 21 Jun 2026 14:33:50 +0000 Subject: [PATCH 1/9] add cua skill plugin New `cua` plugin documenting both the `cua` CLI (`@onkernel/cua-cli`) and the `@onkernel/cua-agent` TS library (`CuaAgent` / `CuaAgentHarness`). Single skill covers one-shot subcommands, named sessions, transcripts, model selection across providers, library quick start, live-view handoff for manual login, and the Playwright escape hatch for deterministic actions against the underlying Kernel browser. Co-Authored-By: Claude Opus 4.7 --- README.md | 12 + plugins/cua/.claude-plugin/plugin.json | 11 + plugins/cua/skills/cua/SKILL.md | 390 +++++++++++++++++++++++++ 3 files changed, 413 insertions(+) create mode 100644 plugins/cua/.claude-plugin/plugin.json create mode 100644 plugins/cua/skills/cua/SKILL.md diff --git a/README.md b/README.md index 31232f1..9860269 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,9 @@ Official AI agent skills from the Kernel for installing useful skills for our CL # Install the video generation skill /plugin install generate-video + +# Install the cua skill (CLI + library for computer-use on Kernel) +/plugin install cua ``` ### Cursor @@ -42,6 +45,7 @@ git clone https://github.com/kernel/skills.git cp -r skills/plugins/kernel-cli ~/.claude/skills/ cp -r skills/plugins/kernel-sdks ~/.claude/skills/ cp -r skills/plugins/generate-video ~/.claude/skills/ +cp -r skills/plugins/cua ~/.claude/skills/ ``` ## Prerequisites @@ -83,6 +87,14 @@ SDK skills for building browser automation with TypeScript and Python. | **typescript-sdk** | Build automation with Kernel's Typescript SDK | | **python-sdk** | Build automation with kernel's Python SDK | +### cua + +Computer-use loop for Kernel cloud browsers — CLI for shell-driven automation and the `@onkernel/cua-agent` TS library for embedding in your own agents. + +| Skill | Description | +|-------|-------------| +| **cua** | Drive Kernel cua via the `cua` CLI (one-shot subcommands, named sessions, TUI) or the `@onkernel/cua-agent` library (`CuaAgent` / `CuaAgentHarness`); covers model selection, profile persistence, transcripts, live-view handoff, and Playwright escape hatches | + ### generate-video Render smooth, deterministic MP4s from web scenes. No Kernel account required — just Chromium, Node, and ffmpeg. diff --git a/plugins/cua/.claude-plugin/plugin.json b/plugins/cua/.claude-plugin/plugin.json new file mode 100644 index 0000000..273978e --- /dev/null +++ b/plugins/cua/.claude-plugin/plugin.json @@ -0,0 +1,11 @@ +{ + "name": "cua", + "version": "1.0.0", + "description": "Drive Kernel cua: the `cua` CLI for shell-driven computer-use automation, and the @onkernel/cua-agent TS library for building your own computer-use agents on Kernel browsers", + "author": { + "name": "Kernel", + "url": "www.kernel.sh" + }, + "repository": "https://github.com/kernel/skills", + "license": "MIT" +} diff --git a/plugins/cua/skills/cua/SKILL.md b/plugins/cua/skills/cua/SKILL.md new file mode 100644 index 0000000..3b3d4cb --- /dev/null +++ b/plugins/cua/skills/cua/SKILL.md @@ -0,0 +1,390 @@ +--- +name: cua +description: Drive Kernel cua — the `cua` CLI for shell automation, or the @onkernel/cua-agent TypeScript library for building your own computer-use agents. Use when opening URLs, clicking/typing/observing in a real cloud browser via cua, chaining multi-step browser tasks across shell calls, or wiring up `CuaAgent` / `CuaAgentHarness` against a Kernel browser. Covers model selection (gpt-5.5, claude-opus-4-7, gemini-3-flash-preview, n1.5-latest), named sessions, profile persistence, transcripts, live-view handoff, and Playwright escape hatches. +--- + +# cua + +`cua` is a computer-use loop for Kernel cloud browsers. There are two surfaces, both backed by the same execution layer: + +- **`cua` CLI** (`@onkernel/cua-cli`) — single binary that drives a real Chrome session running in Kernel. Each subcommand returns a one-line result on stdout and a deterministic exit code, so shell agents can chain calls. +- **`@onkernel/cua-agent` library** — `CuaAgent` / `CuaAgentHarness` TypeScript classes that run the same prompt → screenshot → tool-call loop against a Kernel browser, callable from your own code. + +Both translate per-provider computer-use tool calls (OpenAI's `computer`, Anthropic's `computer_20251124`, Gemini's normalized-coordinate functions, Yutori Navigator's browser actions) into Kernel SDK `browsers.computer.*` calls and feed a fresh screenshot back to the model on every turn. + +## When to use this skill + +- **Use the CLI** when you need shell-callable computer-use steps (`cua open`, `cua click`, `cua do …`) or an interactive TUI. Best for ad-hoc agent tasks, shell pipelines, and one-shot prompts. +- **Use the library** when you need to embed cua inside a larger TS app, run a custom session repo, add your own pi tools alongside computer use, or react to per-event streams programmatically. +- **Reach for `kernel-agent-browser` instead** when you need deterministic browser scripting (semantic selectors, `find role`, `wait --text`, snapshots/refs). cua drives by screenshots; agent-browser drives by accessibility tree. +- **Reach for `kernel-typescript-sdk` instead** for raw Playwright/CDP control over a Kernel browser without an LLM in the loop. + +## Prerequisites + +- A Kernel account and API key (`KERNEL_API_KEY`). See the [`kernel-cli`](https://www.kernel.sh/docs) skill for install + auth. +- At least one model-provider API key, matched to the model you pick (table in "Model selection" below). +- Node 20+ for both the CLI install and the library. + +## Install + +### CLI + +```bash +# Global install — gives you the `cua` binary on $PATH +npm i -g @onkernel/cua-cli + +# Or zero-install one-shot +npx -y -p @onkernel/cua-cli cua --help +``` + +### Library + +```bash +npm i @onkernel/cua-agent @onkernel/cua-ai @onkernel/sdk +``` + +## Environment variables + +| Env | Used for | +| --- | --- | +| `KERNEL_API_KEY` | Kernel API key (always required) | +| `OPENAI_API_KEY` | OpenAI models (`-m openai:…`) | +| `ANTHROPIC_API_KEY` | Anthropic models (`-m anthropic:…`); `ANTHROPIC_OAUTH_TOKEN` also works | +| `GOOGLE_API_KEY` / `GEMINI_API_KEY` | Google / Gemini models (`-m google:…`) | +| `YUTORI_API_KEY` | Yutori Navigator (`-m yutori:…`) | +| `TZAFON_API_KEY` | Tzafon (`-m tzafon:…`) | +| `KERNEL_BASE_URL` | Override Kernel base URL | +| `XDG_DATA_HOME` | CLI sessions/transcripts dir (defaults to `~/.local/share`) | +| `CUA_IMAGE_PROTOCOL` | Force inline image protocol (`kitty` / `iterm2` / `none` / `auto`) | + +The library auto-loads these via `getCuaEnvApiKey` if you don't pass explicit auth callbacks. + +## CLI: one-shot subcommands + +Each call provisions a fresh Kernel browser by default, runs the action, prints a one-line result, and tears the browser down. Chain via `-s ` (next section) to keep state. + +| Subcommand | What it does | Stdout | Exit code | +| --- | --- | --- | --- | +| `cua open ` | Navigate to a URL. | `ok` | 0 ok, 2 error | +| `cua click ""` | Find element matching natural-language description and click it. | `ok clicked (x, y)` or `not_found ` | 0 ok, 1 not_found, 2 error | +| `cua type "" ""` | Focus a field by description and type. | `ok typed` or `not_found ` | 0 ok, 1 not_found, 2 error | +| `cua press [...]` | Send a key combo (`cua press ctrl l`, `cua press Return`). | `ok pressed` | 0 ok, 2 error | +| `cua url` | Print the current URL. | the URL | 0 ok, 2 error | +| `cua observe [""]` | Describe the page; optionally answer a question. | the description | 0 ok, 2 error | +| `cua screenshot --out ` | Save a PNG. `--out -` writes bytes to stdout. | the path or `(stdout)` | 0 ok, 2 error | +| `cua do ""` | Open-ended; agent plans and acts. Bound by `--max-steps` (default 3). | the assistant's final text | 0 ok, 2 error | + +Useful flags: + +- `-m ` — pick the LLM (default `openai:gpt-5.5`). `cua models` to list. +- `--max-steps ` — bound the loop on `cua do`. +- `--profile ` — load a Kernel browser profile for persisted cookies / storage. Existing ids or names are reused; a non-id name is created if missing. Pass `--profile-no-save-changes` for read-only. +- `-v` — verbose progress on stderr (provisioning, tool calls, transcript path). + +`click` and `type` match **semantically**, not by selector — use natural-language descriptions of what's visible on screen. + +## CLI: named sessions + +Without `-s`, each subcommand provisions a brand-new browser. To keep state (cookies, URL, scroll position) across calls, allocate a named session first: + +```bash +cua --profile github session start login # provisions a Kernel browser, prints `name=login` +cua -s login open https://github.com/login +cua -s login type "email field" "$EMAIL" +cua -s login type "password field" "$PASSWORD" +cua -s login click "Sign in" +cua -s login url # prints post-login URL +cua session stop login # tears down the Kernel browser +``` + +Inspect: + +```bash +cua session list # NAME / KERNEL_ID / AGE / LIVE_URL +cua session show login # full JSON metadata +``` + +Pass `--profile` when starting the named session; later `cua -s …` calls attach to the same browser, so they don't need the profile flag. + +**Liveness**: Kernel browsers time out from inactivity. If you see `error session "" is no longer alive on Kernel …`, run `cua session stop && cua --profile session start ` to re-provision with the same persisted profile. + +Named-session metadata lives in `$XDG_DATA_HOME/cua/named-sessions/.json`. + +## CLI: free-form mode + +```bash +cua --print "open hn and tell me the top story" # one-shot, streams text +cua --print -o jsonl "..." # one-shot, streams JSONL events +cua "..." # interactive TUI (real terminal) +``` + +`--print` exits when the agent finishes; the TUI runs until Ctrl+C. Add `--jsonl-include-deltas` for token deltas, `--jsonl-include-images` for base64 screenshots in `tool_result` events. + +## Library: quick start with `CuaAgentHarness` + +The harness is the recommended entry point. It owns the session, persists every turn, handles steering / follow-up, and can swap providers mid-conversation. + +```ts +import Kernel from "@onkernel/sdk"; +import { + CuaAgentHarness, + InMemorySessionRepo, + NodeExecutionEnv, +} from "@onkernel/cua-agent"; +import type { AssistantMessage } from "@onkernel/cua-ai"; + +const client = new Kernel({ apiKey: process.env.KERNEL_API_KEY! }); +const browser = await client.browsers.create({ stealth: true }); + +const repo = new InMemorySessionRepo(); +const session = await repo.create({ id: "research" }); + +const harness = new CuaAgentHarness({ + browser, + client, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + model: "openai:gpt-5.5", + session, +}); + +const textOf = (m: AssistantMessage) => + m.content.flatMap((b) => (b.type === "text" ? [b.text] : [])).join("").trim(); + +const first = await harness.prompt("Open example.com and describe what you see."); +console.log(textOf(first)); + +// Swap providers mid-session — CUA tools and the default prompt refresh. +await harness.setModel("anthropic:claude-opus-4-7"); +const second = await harness.prompt("Open the most relevant link from what you found."); +console.log(textOf(second)); + +await client.browsers.deleteByID(browser.session_id); +``` + +While a turn is running: `steer()` injects course corrections, `followUp()` queues the next instruction, `subscribe()` streams underlying agent events, and `compact()` collapses long transcripts. + +### When to use `CuaAgent` instead + +Reach for `CuaAgent` (extends pi `Agent`) when you want raw control — direct `state.messages` access, custom streaming, explicit prompt/continue/queue, no session repo. The shape is the same except you assign `agent.state.model = …` instead of calling `setModel()`. + +```ts +import { CuaAgent } from "@onkernel/cua-agent"; + +const agent = new CuaAgent({ + browser, + client, + initialState: { + model: "openai:gpt-5.5", + systemPrompt: "You are a careful browser automation agent.", + }, +}); + +agent.subscribe((event) => { /* … */ }); +await agent.prompt("Open news.ycombinator.com and summarize the top story."); +``` + +### CLI vs library vs raw `CuaAgent` + +| You want to … | Use | +| --- | --- | +| Drive cua from shell scripts | CLI | +| Open-ended TUI session | CLI (`cua` no args) | +| Embed cua inside a TS app with session-backed turns | `CuaAgentHarness` | +| Add your own pi tools alongside computer use | `CuaAgentHarness` (`extraTools`) or `CuaAgent` | +| Raw pi `Agent` semantics: own message state, lifecycle events | `CuaAgent` | + +## Model selection + +Run `cua models` (or `listCuaModels()` from `@onkernel/cua-ai`) for the current catalog. As of writing, the four supported providers and their built-in computer-use vocabularies: + +| Model ref | Provider | Notes | +| --- | --- | --- | +| `openai:gpt-5.5` | OpenAI | Built-in `computer` tool; default in CLI. | +| `anthropic:claude-opus-4-7` | Anthropic | Built-in `computer_20251124` tool. Supports `--thinking` levels. | +| `google:gemini-3-flash-preview` | Google | Predefined computer-use functions with 0–1000 normalized coords. | +| `yutori:n1.5-latest` | Yutori | OpenAI-compatible chat with browser action tool calls. | + +Switching models mid-turn: + +- CLI: re-run with `-m `, or attach a `-s` named session with a different `-m` per call. +- Library (harness): `await harness.setModel("anthropic:claude-opus-4-7")` — CUA tools and the default system prompt refresh. +- Library (agent): assign `agent.state.model = "anthropic:claude-opus-4-7"`. + +Not every provider's native vocabulary includes navigation. Pass `computerUseExtra: true` to add the provider-neutral `computer_use_extra` tool (`goto`, `back`, `forward`, `url`) when you need it on a model that lacks built-in navigation. + +## Browser config + +Provision the underlying Kernel browser to match the task before handing it to cua: + +```ts +const browser = await client.browsers.create({ + stealth: true, // bypass most fingerprinting; default off + headless: false, // headful => live view URL; smaller image when headless + timeout: 1800, // seconds before the Kernel browser auto-times-out + profile: { name: "github", save_changes: true }, // load + save persisted state + // proxy: { ... }, // optional outbound proxy +}); +``` + +The CLI exposes equivalents via `--profile`, `--profile-no-save-changes`, and the underlying Kernel CLI flags (the cua CLI itself doesn't surface a `--stealth` flag yet — when stealth matters, use the library or pre-create the browser via `kernel browsers create` and reuse the session). + +## Adding your own tools + +```ts +import { CuaAgentHarness } from "@onkernel/cua-agent"; +import { tool } from "@earendil-works/pi-agent-core"; + +const lookupOrder = tool({ + name: "lookup_order", + description: "Look up an order by id in our DB.", + schema: { /* … */ }, + handler: async ({ orderId }) => { + return await db.orders.findOne(orderId); + }, +}); + +const harness = new CuaAgentHarness({ + browser, client, + model: "openai:gpt-5.5", + session, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + extraTools: [lookupOrder], + computerUseExtra: true, +}); +``` + +Use `createCuaComputerTools()` directly if you want to compose the tool list yourself (e.g. wrap computer-use tools in a permission gate): + +```ts +import { resolveCuaRuntimeSpec } from "@onkernel/cua-ai"; +import { createCuaComputerTools } from "@onkernel/cua-agent"; + +const runtime = resolveCuaRuntimeSpec("openai:gpt-5.5"); +const tools = [ + ...createCuaComputerTools({ browser, client, toolExecutors: runtime.toolExecutors }), + lookupOrder, +]; +``` + +## Live view URL and manual login fallback + +cua's `--profile` (CLI) and `profile` (library) handle most login persistence, but stealth doesn't always beat bot detection. When automation gets stuck on a login, hand off to a human via the live view URL. + +### CLI + +```bash +cua --profile mysite session start login +cua session show login | jq -r .live_url # share this URL with the user +# user logs in manually in their browser via the live view +cua -s login url # confirm the post-login URL +cua session stop login # profile state saves on teardown +``` + +### Library + +Every Kernel browser response carries the live view URL on creation: + +```ts +const browser = await client.browsers.create({ stealth: true, headless: false }); +console.log("live view:", browser.browser_live_view_url); +// share that URL, wait for the user to finish manual login, then prompt the agent +``` + +If you only have a session id, fetch it: + +```bash +kernel browsers view +``` + +## Cross-origin iframes / Playwright escape hatch + +cua drives by clicking pixels, so cross-origin iframes (payment forms, embedded vendor widgets) work in the screenshot flow without special handling — the model just clicks them. When you need a deterministic Playwright action against the underlying browser (e.g. to fill a card form via a fixed selector), break out to Kernel's exec endpoint with the session id: + +```bash +# CLI: find the session id +cua session show login | jq -r .kernel_session_id + +# Run a Playwright snippet against the same browser +kernel browsers exec --code " + const frame = page.frameLocator('#payment-iframe'); + await frame.locator('#card-number').fill('4111111111111111'); + await frame.locator('#submit').click(); +" +``` + +From the library, you already have `browser.session_id` and the Kernel client, so call into the SDK directly. + +## Debugging + +- **CLI verbose**: `cua -v --print "…"` writes provisioning info, tool calls, and the transcript path to stderr. +- **Live event stream**: `cua --print -o jsonl "…"` emits one event per line (`tool_call`, `tool_result`, `assistant_text_done`, etc.). Add `--jsonl-include-images` to inline screenshots in `tool_result`. +- **Persisted transcript**: every `--print`, TUI, and `-s ` invocation appends to `$XDG_DATA_HOME/cua/sessions//.jsonl`. Exact path: + ```bash + cua -v --print "..." # stderr includes: [cua] session= + cua session show login | jq -r .transcript_path + ``` + Roles: `user`, `assistant`, `toolResult`. There's also a custom `cua-browser` entry written once per session with `kernel_session_id` / `live_url` / `profile_id`. +- **Library event subscription**: + ```ts + harness.subscribe((event) => { + // event.type === "tool_call" | "tool_result" | "assistant_text_done" | ... + }); + ``` +- **Screenshots**: `cua screenshot --out shot.png` (CLI) or inspect the `image` blocks in `toolResult` transcript entries. +- **Page URL**: `cua url` to confirm post-action navigation. `agent.state.messages` (library) holds the full message history. + +A couple of `jq` starters against a transcript path: + +```bash +# Every tool call the agent made, in order +jq -c 'select(.role == "assistant") | .content[]? + | select(.type == "tool_use") | {name, input}' "$TRANSCRIPT" + +# Final assistant text (the answer) +jq -r 'select(.role == "assistant") | .content[]? + | select(.type == "text") | .text' "$TRANSCRIPT" | tail -1 +``` + +## Gotchas + +- **Element descriptions are semantic, not selectors.** `cua click "Sign in button"` looks at the screenshot — describe what the user sees, not a CSS selector. +- **Viewport defaults to 1920x1080.** Resize via `client.browsers.create({ ... })` flags if you need something else. +- **Keyboard navigation > mouse-wheel scroll.** `cua press Page_Down` / `Home` / arrow keys is more reliable than scroll wheel via the LLM. +- **Multi-step state requires `-s` (CLI) or a session-backed harness (library).** A second one-shot subcommand can't see what the first one did. +- **Profile saves on close, not continuously.** Tear down cleanly (`cua session stop`, `client.browsers.deleteByID`) or you'll lose recent state. +- **Provider tool vocab gaps.** If a model can click and type but can't navigate, set `computerUseExtra: true` (library) or pick a different model. +- **`--max-steps` defaults to 3 on `cua do`.** Bump it for non-trivial tasks. + +## Quick reference + +```bash +# CLI quickstart — one-shot, fresh browser +cua --print "open hn and tell me the top story" + +# CLI — named session for multi-step +cua --profile mysite session start work +cua -s work open https://example.com +cua -s work click "Log in" +cua -s work type "email field" "$EMAIL" +cua -s work click "Submit" +cua -s work url +cua session stop work + +# CLI — list models, switch model per call +cua models +cua --print -m anthropic:claude-opus-4-7 "..." + +# Get the live view URL +cua session show work | jq -r .live_url +kernel browsers view # alternative + +# Library — minimal harness +import { CuaAgentHarness, InMemorySessionRepo, NodeExecutionEnv } from "@onkernel/cua-agent"; +const session = await new InMemorySessionRepo().create({ id: "main" }); +const harness = new CuaAgentHarness({ + browser, client, session, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + model: "openai:gpt-5.5", +}); +const result = await harness.prompt("Open example.com and click the first link."); +``` From bb8b3b3da0132e4e5dd97e8f83cf44ed4f442f57 Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Sun, 21 Jun 2026 14:58:48 +0000 Subject: [PATCH 2/9] cua skill: fix self-review findings - Browser config: CLI hardcodes stealth-on; library is the opt-out path. - Adding your own tools: drop unverified `tool()` helper, point at pi-agent-core's AgentTool shape instead. - Cross-origin section: tighten library escape-hatch sentence. - Quick reference: split the trailing TS example into its own ts fence. - Named-session relaunch tip: clarify "same profile as before". Co-Authored-By: Claude Opus 4.7 --- plugins/cua/skills/cua/SKILL.md | 34 ++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/plugins/cua/skills/cua/SKILL.md b/plugins/cua/skills/cua/SKILL.md index 3b3d4cb..f67481e 100644 --- a/plugins/cua/skills/cua/SKILL.md +++ b/plugins/cua/skills/cua/SKILL.md @@ -106,7 +106,7 @@ cua session show login # full JSON metadata Pass `--profile` when starting the named session; later `cua -s …` calls attach to the same browser, so they don't need the profile flag. -**Liveness**: Kernel browsers time out from inactivity. If you see `error session "" is no longer alive on Kernel …`, run `cua session stop && cua --profile session start ` to re-provision with the same persisted profile. +**Liveness**: Kernel browsers time out from inactivity. If you see `error session "" is no longer alive on Kernel …`, run `cua session stop && cua --profile session start ` to re-provision with the same persisted profile. Named-session metadata lives in `$XDG_DATA_HOME/cua/named-sessions/.json`. @@ -214,34 +214,31 @@ Not every provider's native vocabulary includes navigation. Pass `computerUseExt ## Browser config -Provision the underlying Kernel browser to match the task before handing it to cua: +The CLI always provisions stealth-on browsers and exposes profile persistence via `--profile` / `--profile-no-save-changes`. For any other browser knob — non-stealth, custom viewport, proxy, custom timeout — use the library and provision the browser yourself: ```ts const browser = await client.browsers.create({ - stealth: true, // bypass most fingerprinting; default off - headless: false, // headful => live view URL; smaller image when headless + stealth: true, // CLI hardcodes this on; flip to false only via the library + headless: false, // headful => live view URL; headless => no live view, smaller image timeout: 1800, // seconds before the Kernel browser auto-times-out profile: { name: "github", save_changes: true }, // load + save persisted state // proxy: { ... }, // optional outbound proxy }); ``` -The CLI exposes equivalents via `--profile`, `--profile-no-save-changes`, and the underlying Kernel CLI flags (the cua CLI itself doesn't surface a `--stealth` flag yet — when stealth matters, use the library or pre-create the browser via `kernel browsers create` and reuse the session). +If you need a custom-provisioned browser from the CLI, pre-create it with `kernel browsers create` and attach via `cua session …` — see the kernel-cli skill for the create flag reference. ## Adding your own tools +Pass any pi `AgentTool` (see [`@earendil-works/pi-agent-core`](https://www.npmjs.com/package/@earendil-works/pi-agent-core) for the tool shape) via `extraTools`. The CUA defaults stay installed; your tools run alongside them. + ```ts +import type { AgentTool } from "@onkernel/cua-agent"; import { CuaAgentHarness } from "@onkernel/cua-agent"; -import { tool } from "@earendil-works/pi-agent-core"; - -const lookupOrder = tool({ - name: "lookup_order", - description: "Look up an order by id in our DB.", - schema: { /* … */ }, - handler: async ({ orderId }) => { - return await db.orders.findOne(orderId); - }, -}); + +const lookupOrder: AgentTool = { + // shape per pi-agent-core docs: name, description, schema, run, ... +}; const harness = new CuaAgentHarness({ browser, client, @@ -312,7 +309,7 @@ kernel browsers exec --code " " ``` -From the library, you already have `browser.session_id` and the Kernel client, so call into the SDK directly. +From the library, you already have `browser.session_id` and the Kernel client — call the same exec endpoint via the SDK. ## Debugging @@ -377,9 +374,12 @@ cua --print -m anthropic:claude-opus-4-7 "..." # Get the live view URL cua session show work | jq -r .live_url kernel browsers view # alternative +``` -# Library — minimal harness +```ts +// Library — minimal harness import { CuaAgentHarness, InMemorySessionRepo, NodeExecutionEnv } from "@onkernel/cua-agent"; + const session = await new InMemorySessionRepo().create({ id: "main" }); const harness = new CuaAgentHarness({ browser, client, session, From b4f25bc0355376618d7ff324fa13c35c7db9ed9b Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Sun, 21 Jun 2026 15:48:39 +0000 Subject: [PATCH 3/9] split cua plugin into cua-cli + cua-agent skills Skills are loaded into coding-agent context, and the two audiences are distinct: agents driving the cua binary from shell vs humans asking Claude to write TS apps on @onkernel/cua-agent. Mirrors the repo's existing CLI-vs-SDK split (kernel-agent-browser vs kernel-typescript-sdk). - cua-cli: shell-callable subcommands, named sessions, profile persistence, live-view handoff, Playwright escape hatch, debugging. - cua-agent: CuaAgent / CuaAgentHarness quick start, browser provisioning, extraTools, setModel switching, SDK escape hatch, subscribe-based debugging. Plugin manifest description and the kernel/skills README list both. Co-Authored-By: Claude Opus 4.7 --- README.md | 5 +- plugins/cua/.claude-plugin/plugin.json | 2 +- plugins/cua/skills/cua-agent/SKILL.md | 293 +++++++++++++++++++ plugins/cua/skills/cua-cli/SKILL.md | 217 ++++++++++++++ plugins/cua/skills/cua/SKILL.md | 390 ------------------------- 5 files changed, 514 insertions(+), 393 deletions(-) create mode 100644 plugins/cua/skills/cua-agent/SKILL.md create mode 100644 plugins/cua/skills/cua-cli/SKILL.md delete mode 100644 plugins/cua/skills/cua/SKILL.md diff --git a/README.md b/README.md index 9860269..d2c203d 100644 --- a/README.md +++ b/README.md @@ -89,11 +89,12 @@ SDK skills for building browser automation with TypeScript and Python. ### cua -Computer-use loop for Kernel cloud browsers — CLI for shell-driven automation and the `@onkernel/cua-agent` TS library for embedding in your own agents. +Computer-use loop for Kernel cloud browsers — CLI for shell-driven automation and the `@onkernel/cua-agent` TS library for embedding in your own agents. One plugin, two skills (load whichever matches the task). | Skill | Description | |-------|-------------| -| **cua** | Drive Kernel cua via the `cua` CLI (one-shot subcommands, named sessions, TUI) or the `@onkernel/cua-agent` library (`CuaAgent` / `CuaAgentHarness`); covers model selection, profile persistence, transcripts, live-view handoff, and Playwright escape hatches | +| **cua-cli** | Drive a Kernel browser from shell via the `cua` binary: one-shot subcommands, named sessions, TUI, profile persistence, transcripts, live-view handoff | +| **cua-agent** | Build TypeScript apps that embed Kernel cua's loop with `CuaAgent` / `CuaAgentHarness`: provider switching, custom tools, session repos, event-stream debugging | ### generate-video diff --git a/plugins/cua/.claude-plugin/plugin.json b/plugins/cua/.claude-plugin/plugin.json index 273978e..436bab5 100644 --- a/plugins/cua/.claude-plugin/plugin.json +++ b/plugins/cua/.claude-plugin/plugin.json @@ -1,7 +1,7 @@ { "name": "cua", "version": "1.0.0", - "description": "Drive Kernel cua: the `cua` CLI for shell-driven computer-use automation, and the @onkernel/cua-agent TS library for building your own computer-use agents on Kernel browsers", + "description": "Kernel cua skills: `cua-cli` for shell-driven computer-use automation via the `cua` binary, and `cua-agent` for building TypeScript apps on the @onkernel/cua-agent library (CuaAgent / CuaAgentHarness)", "author": { "name": "Kernel", "url": "www.kernel.sh" diff --git a/plugins/cua/skills/cua-agent/SKILL.md b/plugins/cua/skills/cua-agent/SKILL.md new file mode 100644 index 0000000..322f908 --- /dev/null +++ b/plugins/cua/skills/cua-agent/SKILL.md @@ -0,0 +1,293 @@ +--- +name: cua-agent +description: Build TypeScript apps that embed Kernel's computer-use loop with `@onkernel/cua-agent` — `CuaAgent` and `CuaAgentHarness` classes drive a Kernel cloud browser via prompt → screenshot → tool-call loops across OpenAI, Anthropic, Google, and Yutori provider tools. Use when writing TS code that needs computer-use against a Kernel browser, swapping providers mid-session, adding your own pi tools alongside computer use, or hooking into the agent event stream. For shell-callable cua, see `cua-cli`. +--- + +# cua-agent + +`@onkernel/cua-agent` ships two TS classes for running a computer-use loop against a Kernel cloud browser: + +- **`CuaAgentHarness`** — recommended entry point. Session-backed turns, `setModel` mid-conversation, steering / follow-up, `subscribe()` event stream. Extends pi-agent-core's `AgentHarness`. +- **`CuaAgent`** — lower-level. Direct `state.messages` access, custom streaming, explicit prompt/continue/queue. Extends pi-agent-core's `Agent`. + +Both translate per-provider computer-use tool calls (OpenAI's `computer`, Anthropic's `computer_20251124`, Gemini's normalized-coordinate functions, Yutori Navigator's browser actions) into Kernel SDK `browsers.computer.*` calls and feed a fresh screenshot back to the model on every turn. + +## When to use this skill + +- **Use this skill** when writing TS code that embeds cua inside a larger app, needs a custom session repo, runs its own pi tools alongside computer use, or reacts to per-event streams programmatically. +- **Reach for [`cua-cli`](../cua-cli/SKILL.md)** when shell-callable computer-use is enough (`cua open`, `cua click`, `cua do`). +- **Reach for `kernel-typescript-sdk`** for raw Playwright / CDP control over a Kernel browser without an LLM in the loop. + +## Prerequisites + +- A Kernel account and `KERNEL_API_KEY`. +- At least one model-provider API key, matched to the model you pick (table below). +- Node 20+, TypeScript app or `tsx` runner. + +## Install + +```bash +npm i @onkernel/cua-agent @onkernel/cua-ai @onkernel/sdk +``` + +The three packages divide responsibility: + +- `@onkernel/cua-agent` — `CuaAgent` / `CuaAgentHarness` execution loop. +- `@onkernel/cua-ai` — model catalog (`getCuaModel` / `listCuaModels`), canonical CUA tool schemas, per-provider adapters. +- `@onkernel/sdk` — Kernel SDK client used to provision the browser. + +Both classes re-export the full pi-agent-core surface from `@onkernel/cua-agent`, including `NodeExecutionEnv` (via the `/node` subpath under the hood) and `InMemorySessionRepo`. Import them from `@onkernel/cua-agent` directly. + +## Environment variables + +If you don't pass explicit auth callbacks, both classes resolve provider keys via `@onkernel/cua-ai`'s `getCuaEnvApiKey`: + +| Env | Used for | +| --- | --- | +| `KERNEL_API_KEY` | Kernel API key (always required) | +| `OPENAI_API_KEY` | `openai:…` models | +| `ANTHROPIC_API_KEY` or `ANTHROPIC_OAUTH_TOKEN` | `anthropic:…` models | +| `GOOGLE_API_KEY` or `GEMINI_API_KEY` | `google:…` models | +| `YUTORI_API_KEY` | `yutori:…` models | +| `TZAFON_API_KEY` | `tzafon:…` models | + +## Quick start — `CuaAgentHarness` + +```ts +import Kernel from "@onkernel/sdk"; +import { + CuaAgentHarness, + InMemorySessionRepo, + NodeExecutionEnv, +} from "@onkernel/cua-agent"; +import type { AssistantMessage } from "@onkernel/cua-ai"; + +const client = new Kernel({ apiKey: process.env.KERNEL_API_KEY! }); +const browser = await client.browsers.create({ stealth: true }); + +const repo = new InMemorySessionRepo(); +const session = await repo.create({ id: "research" }); + +const harness = new CuaAgentHarness({ + browser, + client, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + model: "openai:gpt-5.5", + session, +}); + +const textOf = (m: AssistantMessage) => + m.content.flatMap((b) => (b.type === "text" ? [b.text] : [])).join("").trim(); + +const first = await harness.prompt("Open example.com and describe what you see."); +console.log(textOf(first)); + +// Swap providers mid-session — CUA tools and the default prompt refresh. +await harness.setModel("anthropic:claude-opus-4-7"); +const second = await harness.prompt("Open the most relevant link from what you found."); +console.log(textOf(second)); + +await client.browsers.deleteByID(browser.session_id); +``` + +While a turn is running: `steer()` injects course corrections, `followUp()` queues the next instruction, `subscribe()` streams underlying agent events, and `compact()` collapses long transcripts. See [`@earendil-works/pi-agent-core`](https://www.npmjs.com/package/@earendil-works/pi-agent-core) for the full harness lifecycle. + +## `CuaAgent` for raw pi `Agent` semantics + +Reach for `CuaAgent` when you want direct control — `state.messages` access, custom streaming, explicit prompt/continue/queue, no session repo. Same constructor shape except you assign `agent.state.model = …` instead of calling `setModel()`. + +```ts +import { CuaAgent } from "@onkernel/cua-agent"; + +const agent = new CuaAgent({ + browser, + client, + initialState: { + model: "openai:gpt-5.5", + systemPrompt: "You are a careful browser automation agent.", + }, +}); + +agent.subscribe((event) => { /* … */ }); +await agent.prompt("Open news.ycombinator.com and summarize the top story."); +``` + +### Harness vs Agent + +| You want to … | Use | +| --- | --- | +| Session-backed turns persisted to a repo | `CuaAgentHarness` | +| Steering, follow-up queue, compaction, branching | `CuaAgentHarness` | +| `await setModel()` mid-conversation | `CuaAgentHarness` | +| Direct `state.messages` access, no session machinery | `CuaAgent` | +| Custom streaming + explicit `prompt`/`continue`/`queue` control | `CuaAgent` | + +## Model selection and switching + +Run `listCuaModels()` from `@onkernel/cua-ai` for the current catalog. Pass either a CUA model ref (e.g. `"openai:gpt-5.5"`) or a concrete pi `Model` — both shape-widen the same options field. + +| Model ref | Provider | Notes | +| --- | --- | --- | +| `openai:gpt-5.5` | OpenAI | Built-in `computer` tool | +| `anthropic:claude-opus-4-7` | Anthropic | Built-in `computer_20251124` tool | +| `google:gemini-3-flash-preview` | Google | Predefined CU functions, 0–1000 normalized coords | +| `yutori:n1.5-latest` | Yutori | OpenAI-compatible chat with browser action tool calls | + +Switching: + +```ts +// Harness — async, updates via pi snapshot machinery +await harness.setModel("anthropic:claude-opus-4-7"); + +// Agent — direct assignment +agent.state.model = "anthropic:claude-opus-4-7"; +``` + +In both cases CUA-owned tools and the default system prompt refresh for the next provider request. + +Not every provider's native vocabulary includes navigation (`goto`, `back`, `forward`, `url`). Pass `computerUseExtra: true` to add the provider-neutral `computer_use_extra` tool when the model can click/type but can't navigate. + +## Browser provisioning + +You own the Kernel browser lifecycle — provision before constructing the agent, tear down after: + +```ts +const browser = await client.browsers.create({ + stealth: true, // bypass most fingerprinting; default off + headless: false, // headful => live view URL; headless => no live view, smaller image + timeout: 1800, // seconds before Kernel auto-times-out the browser + profile: { name: "github", save_changes: true }, + // proxy: { ... }, +}); + +try { + // ... use browser with harness/agent ... +} finally { + await client.browsers.deleteByID(browser.session_id); +} +``` + +The `browser.browser_live_view_url` field on the create response is the URL to share when you need a human to take over (manual login on a stealth-blocked site, captcha, etc.). + +## Adding your own tools + +Pass any pi `AgentTool` (see [`@earendil-works/pi-agent-core`](https://www.npmjs.com/package/@earendil-works/pi-agent-core) for the tool shape) via `extraTools`. The CUA defaults stay installed; your tools run alongside them. + +```ts +import type { AgentTool } from "@onkernel/cua-agent"; +import { CuaAgentHarness } from "@onkernel/cua-agent"; + +const lookupOrder: AgentTool = { + // shape per pi-agent-core docs: name, description, schema, run, ... +}; + +const harness = new CuaAgentHarness({ + browser, client, + model: "openai:gpt-5.5", + session, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + extraTools: [lookupOrder], + computerUseExtra: true, +}); +``` + +If you want to compose the tool list yourself (e.g. wrap computer-use tools in a permission gate), reach for `createCuaComputerTools()`: + +```ts +import { resolveCuaRuntimeSpec } from "@onkernel/cua-ai"; +import { createCuaComputerTools } from "@onkernel/cua-agent"; + +const runtime = resolveCuaRuntimeSpec("openai:gpt-5.5"); +const tools = [ + ...createCuaComputerTools({ browser, client, toolExecutors: runtime.toolExecutors }), + lookupOrder, +]; +``` + +## Manual login handoff via live view URL + +Every Kernel browser response carries the live view URL on creation. When stealth doesn't beat bot detection, share that URL and wait for the human: + +```ts +const browser = await client.browsers.create({ + stealth: true, + headless: false, + profile: { name: "mysite", save_changes: true }, +}); +console.log("share with user:", browser.browser_live_view_url); + +// wait for user signal — e.g. a button, stdin, an HTTP callback — +// THEN start prompting the agent against the logged-in browser +await harness.prompt("Now click 'Settings' and read me the current value of X."); +``` + +Profile saves on browser teardown, so future runs with the same profile name skip the manual login. + +## Cross-origin iframes / Playwright escape hatch + +cua drives by clicking pixels, so cross-origin iframes work in the screenshot flow without special handling. When you need a deterministic Playwright action against the underlying browser (e.g. fill a card form via a fixed selector), drop to the Kernel SDK's exec endpoint with the session id you already have: + +```ts +await client.browsers.exec(browser.session_id, { + code: ` + const frame = page.frameLocator('#payment-iframe'); + await frame.locator('#card-number').fill('4111111111111111'); + await frame.locator('#submit').click(); + `, +}); +``` + +## Debugging + +- **`subscribe()`** — the harness and agent both stream pi-agent-core events. Use it to log tool calls, screenshot sizes, tokens: + ```ts + harness.subscribe((event) => { + if (event.type === "tool_call") console.log("tool:", event.toolName); + if (event.type === "assistant_text_done") console.log("text:", event.text); + }); + ``` +- **`agent.state.messages`** — full message history including image blocks (for `CuaAgent`). Inspect after a turn finishes. +- **Live view URL** — `browser.browser_live_view_url` lets you watch the agent work in real time, even headful. +- **Custom session repo** — implement pi-agent-core's `SessionRepo` interface to persist transcripts wherever you want (JSONL on disk, S3, a DB). + +## Gotchas + +- **You own the browser lifecycle.** Always tear down with `client.browsers.deleteByID(browser.session_id)` in a `finally` block — Kernel timeouts will reclaim eventually but profile state saves on close, not continuously. +- **`setModel` is async.** It propagates through pi's snapshot machinery — `await` it before the next `prompt()`. +- **Provider tool vocab gaps.** If a model can click and type but can't navigate, set `computerUseExtra: true` to add provider-neutral `goto` / `back` / `forward` / `url`. +- **`InMemorySessionRepo` is in-process only.** Reach for a persistent `SessionRepo` implementation if you need transcripts to survive restarts. +- **`extraTools` runs alongside CUA tools, not in place of them.** To replace the defaults, build the tool list with `createCuaComputerTools()` yourself. +- **Stealth, headless, viewport, proxy** are all `browsers.create` flags — set them when provisioning, not on the harness. + +## Quick reference + +```ts +import Kernel from "@onkernel/sdk"; +import { + CuaAgentHarness, + InMemorySessionRepo, + NodeExecutionEnv, +} from "@onkernel/cua-agent"; + +const client = new Kernel({ apiKey: process.env.KERNEL_API_KEY! }); +const browser = await client.browsers.create({ stealth: true }); + +const session = await new InMemorySessionRepo().create({ id: "main" }); + +const harness = new CuaAgentHarness({ + browser, client, session, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + model: "openai:gpt-5.5", + computerUseExtra: true, +}); + +harness.subscribe((event) => { /* ... */ }); + +try { + const first = await harness.prompt("Open example.com and click the first link."); + await harness.setModel("anthropic:claude-opus-4-7"); + const second = await harness.prompt("Now extract the page title."); +} finally { + await client.browsers.deleteByID(browser.session_id); +} +``` diff --git a/plugins/cua/skills/cua-cli/SKILL.md b/plugins/cua/skills/cua-cli/SKILL.md new file mode 100644 index 0000000..527b826 --- /dev/null +++ b/plugins/cua/skills/cua-cli/SKILL.md @@ -0,0 +1,217 @@ +--- +name: cua-cli +description: Drive a Kernel cloud browser from the shell using the `cua` CLI. Use this skill when you need to open URLs, click elements, type into fields, take screenshots, or chain multi-step browser tasks across shell calls. Supports named sessions for stateful workflows, profile persistence for logins, transcript-based debugging, and live-view handoff when stealth fails. For building your own TS agent on top of cua, see `cua-agent`. +--- + +# cua-cli + +`cua` is a single-binary CLI that drives a real Chrome session running in Kernel. It's designed for agentic use: each subcommand returns a one-line result on stdout and a deterministic exit code, so you can chain calls together and parse the output. An LLM picks targets semantically from screenshots — there are no CSS selectors. + +## When to use this skill + +- **Use this skill** when you need shell-callable computer-use steps (`cua open`, `cua click`, `cua do …`), an interactive TUI, or want to chain browser actions in a shell pipeline. +- **Reach for [`cua-agent`](../cua-agent/SKILL.md)** when you're writing a TypeScript app that needs to embed cua's prompt → screenshot → tool-call loop programmatically. +- **Reach for `kernel-agent-browser`** when you need deterministic browser scripting (semantic selectors, `find role`, `wait --text`, accessibility-tree snapshots). +- **Reach for `kernel-cli`** for raw Kernel browser management (`kernel browsers create`, `kernel browsers exec`, profile / proxy CRUD). + +## Prerequisites + +- A Kernel account and `KERNEL_API_KEY`. See `kernel-cli` for install + auth. +- At least one model-provider API key, matched to the model you pick (table below). +- Node 20+ for the npm install. + +## Install + +```bash +# Global install — puts `cua` on $PATH +npm i -g @onkernel/cua-cli + +# Or zero-install one-shot +npx -y -p @onkernel/cua-cli cua --help +``` + +## Environment variables + +| Env | Used for | +| --- | --- | +| `KERNEL_API_KEY` | Kernel API key (always required) | +| `OPENAI_API_KEY` | OpenAI models (`-m openai:…`) | +| `ANTHROPIC_API_KEY` | Anthropic models (`-m anthropic:…`) | +| `GOOGLE_API_KEY` / `GEMINI_API_KEY` | Google / Gemini models (`-m google:…`) | +| `YUTORI_API_KEY` | Yutori Navigator (`-m yutori:…`) | +| `TZAFON_API_KEY` | Tzafon (`-m tzafon:…`) | +| `KERNEL_BASE_URL` | Override Kernel base URL | +| `XDG_DATA_HOME` | Sessions / transcripts dir (defaults to `~/.local/share`) | +| `CUA_IMAGE_PROTOCOL` | Force inline image protocol (`kitty` / `iterm2` / `none` / `auto`) | + +## One-shot subcommands + +Each call provisions a fresh Kernel browser by default, runs the action, prints a one-line result, and tears the browser down. Chain via `-s ` (next section) to keep state. + +| Subcommand | What it does | Stdout | Exit code | +| --- | --- | --- | --- | +| `cua open ` | Navigate to a URL. | `ok` | 0 ok, 2 error | +| `cua click ""` | Find element matching natural-language description and click it. | `ok clicked (x, y)` or `not_found ` | 0 ok, 1 not_found, 2 error | +| `cua type "" ""` | Focus a field by description and type. | `ok typed` or `not_found ` | 0 ok, 1 not_found, 2 error | +| `cua press [...]` | Send a key combo (`cua press ctrl l`, `cua press Return`). | `ok pressed` | 0 ok, 2 error | +| `cua url` | Print the current URL. | the URL | 0 ok, 2 error | +| `cua observe [""]` | Describe the page; optionally answer a question. | the description | 0 ok, 2 error | +| `cua screenshot --out ` | Save a PNG. `--out -` writes bytes to stdout. | the path or `(stdout)` | 0 ok, 2 error | +| `cua do ""` | Open-ended; agent plans and acts. Bound by `--max-steps` (default 3). | the assistant's final text | 0 ok, 2 error | + +Useful flags: + +- `-m ` — pick the LLM (default `openai:gpt-5.5`). `cua models` to list. +- `--max-steps ` — bound the loop on `cua do`. +- `--profile ` — load a Kernel browser profile for persisted cookies / storage. Existing ids or names are reused; a non-id name is created if missing. Pass `--profile-no-save-changes` for read-only. +- `-v` — verbose progress on stderr (provisioning, tool calls, transcript path). + +`click` and `type` match **semantically**, not by selector — use natural-language descriptions of what's visible on screen. + +The cua CLI always provisions **stealth-on** browsers. If you need non-stealth or a custom viewport / proxy, pre-create the browser via `kernel browsers create` and attach the cua session to it. + +## Named sessions + +Without `-s`, each subcommand provisions a brand-new browser. To keep state across calls, allocate a named session first: + +```bash +cua --profile github session start login # provisions a Kernel browser, prints `name=login` +cua -s login open https://github.com/login +cua -s login type "email field" "$EMAIL" +cua -s login type "password field" "$PASSWORD" +cua -s login click "Sign in" +cua -s login url # prints post-login URL +cua session stop login # tears down the Kernel browser +``` + +Inspect: + +```bash +cua session list # NAME / KERNEL_ID / AGE / LIVE_URL +cua session show login # full JSON metadata +``` + +Pass `--profile` when starting the named session; later `cua -s …` calls attach to the same browser, so they don't need the profile flag again. + +**Liveness**: Kernel browsers time out from inactivity. If you see `error session "" is no longer alive on Kernel …`, run `cua session stop && cua --profile session start ` to re-provision with the same persisted profile. + +Named-session metadata lives in `$XDG_DATA_HOME/cua/named-sessions/.json`. + +## Free-form mode + +```bash +cua --print "open hn and tell me the top story" # one-shot, streams text +cua --print -o jsonl "..." # one-shot, streams JSONL events +cua "..." # interactive TUI (real terminal) +``` + +`--print` exits when the agent finishes; the TUI runs until Ctrl+C. Add `--jsonl-include-deltas` for token deltas, `--jsonl-include-images` for base64 screenshots in `tool_result` events. + +## Model selection + +Run `cua models` for the current catalog. Pick with `-m ` (default `openai:gpt-5.5`). Switch per call or per named session. + +| Model ref | Provider | +| --- | --- | +| `openai:gpt-5.5` | OpenAI (default) | +| `anthropic:claude-opus-4-7` | Anthropic (supports `--thinking off\|minimal\|low\|medium\|high\|xhigh`) | +| `google:gemini-3-flash-preview` | Google / Gemini | +| `yutori:n1.5-latest` | Yutori Navigator | + +Not every provider's native vocabulary includes navigation. If a model can click and type but can't navigate (`goto`, `back`, `forward`, `url`), pick a different model. + +## Live view URL and manual login fallback + +Stealth-on doesn't always beat bot detection. When automation gets stuck on a login, hand off to a human via the live view URL. + +```bash +cua --profile mysite session start login +cua session show login | jq -r .live_url # share this URL with the user +# user logs in manually in the live view +cua -s login url # confirm post-login URL +cua session stop login # profile state saves on teardown +``` + +If you only have a session id (e.g. from `cua session list`), the `kernel` CLI also surfaces it: + +```bash +kernel browsers view +``` + +## Cross-origin iframes / Playwright escape hatch + +cua drives by clicking pixels, so cross-origin iframes (payment forms, embedded vendor widgets) work in the screenshot flow without special handling — the model just clicks them. When you need a deterministic Playwright action against the underlying browser (e.g. fill a card form via a fixed selector), break out to Kernel's exec endpoint with the session id: + +```bash +# Find the session id +cua session show login | jq -r .kernel_session_id + +# Run a Playwright snippet against the same browser +kernel browsers exec --code " + const frame = page.frameLocator('#payment-iframe'); + await frame.locator('#card-number').fill('4111111111111111'); + await frame.locator('#submit').click(); +" +``` + +## Debugging + +- **Verbose stderr**: `cua -v --print "…"` writes provisioning info, tool calls, and the transcript path to stderr. +- **Live event stream**: `cua --print -o jsonl "…"` emits one event per line (`tool_call`, `tool_result`, `assistant_text_done`, etc.). Add `--jsonl-include-images` to inline screenshots in `tool_result`. +- **Persisted transcript**: every `--print`, TUI, and `-s ` invocation appends to `$XDG_DATA_HOME/cua/sessions//.jsonl`. Find the exact path: + ```bash + cua -v --print "..." # stderr includes: [cua] session= + cua session show login | jq -r .transcript_path + ``` + Roles: `user`, `assistant`, `toolResult`. There's also a custom `cua-browser` entry written once per session with `kernel_session_id` / `live_url` / `profile_id`. +- **Screenshots**: `cua screenshot --out shot.png` or inspect `image` blocks in `toolResult` transcript entries. +- **Page URL**: `cua url` to confirm post-action navigation. + +A few `jq` starters against a transcript path: + +```bash +# Every tool call the agent made, in order +jq -c 'select(.role == "assistant") | .content[]? + | select(.type == "tool_use") | {name, input}' "$TRANSCRIPT" + +# Final assistant text (the answer) +jq -r 'select(.role == "assistant") | .content[]? + | select(.type == "text") | .text' "$TRANSCRIPT" | tail -1 +``` + +## Gotchas + +- **Element descriptions are semantic, not selectors.** `cua click "Sign in button"` looks at the screenshot — describe what the user sees, not a CSS selector. +- **Viewport defaults to 1920x1080.** Pre-create the browser with `kernel browsers create` if you need something else. +- **Keyboard navigation > mouse-wheel scroll.** `cua press Page_Down` / `Home` / arrow keys is more reliable than scroll wheel via the LLM. +- **Multi-step state requires `-s `.** A second one-shot subcommand can't see what the first one did. +- **Profile saves on close, not continuously.** Tear down cleanly with `cua session stop` or you'll lose recent state. +- **`--max-steps` defaults to 3 on `cua do`.** Bump it for non-trivial tasks. + +## Quick reference + +```bash +# One-shot, fresh browser +cua --print "open hn and tell me the top story" + +# Named session for multi-step +cua --profile mysite session start work +cua -s work open https://example.com +cua -s work click "Log in" +cua -s work type "email field" "$EMAIL" +cua -s work click "Submit" +cua -s work url +cua session stop work + +# List models, switch model per call +cua models +cua --print -m anthropic:claude-opus-4-7 "..." + +# Get the live view URL +cua session show work | jq -r .live_url +kernel browsers view # alternative + +# Drop to Playwright for deterministic actions +cua session show work | jq -r .kernel_session_id +kernel browsers exec --code "..." +``` diff --git a/plugins/cua/skills/cua/SKILL.md b/plugins/cua/skills/cua/SKILL.md deleted file mode 100644 index f67481e..0000000 --- a/plugins/cua/skills/cua/SKILL.md +++ /dev/null @@ -1,390 +0,0 @@ ---- -name: cua -description: Drive Kernel cua — the `cua` CLI for shell automation, or the @onkernel/cua-agent TypeScript library for building your own computer-use agents. Use when opening URLs, clicking/typing/observing in a real cloud browser via cua, chaining multi-step browser tasks across shell calls, or wiring up `CuaAgent` / `CuaAgentHarness` against a Kernel browser. Covers model selection (gpt-5.5, claude-opus-4-7, gemini-3-flash-preview, n1.5-latest), named sessions, profile persistence, transcripts, live-view handoff, and Playwright escape hatches. ---- - -# cua - -`cua` is a computer-use loop for Kernel cloud browsers. There are two surfaces, both backed by the same execution layer: - -- **`cua` CLI** (`@onkernel/cua-cli`) — single binary that drives a real Chrome session running in Kernel. Each subcommand returns a one-line result on stdout and a deterministic exit code, so shell agents can chain calls. -- **`@onkernel/cua-agent` library** — `CuaAgent` / `CuaAgentHarness` TypeScript classes that run the same prompt → screenshot → tool-call loop against a Kernel browser, callable from your own code. - -Both translate per-provider computer-use tool calls (OpenAI's `computer`, Anthropic's `computer_20251124`, Gemini's normalized-coordinate functions, Yutori Navigator's browser actions) into Kernel SDK `browsers.computer.*` calls and feed a fresh screenshot back to the model on every turn. - -## When to use this skill - -- **Use the CLI** when you need shell-callable computer-use steps (`cua open`, `cua click`, `cua do …`) or an interactive TUI. Best for ad-hoc agent tasks, shell pipelines, and one-shot prompts. -- **Use the library** when you need to embed cua inside a larger TS app, run a custom session repo, add your own pi tools alongside computer use, or react to per-event streams programmatically. -- **Reach for `kernel-agent-browser` instead** when you need deterministic browser scripting (semantic selectors, `find role`, `wait --text`, snapshots/refs). cua drives by screenshots; agent-browser drives by accessibility tree. -- **Reach for `kernel-typescript-sdk` instead** for raw Playwright/CDP control over a Kernel browser without an LLM in the loop. - -## Prerequisites - -- A Kernel account and API key (`KERNEL_API_KEY`). See the [`kernel-cli`](https://www.kernel.sh/docs) skill for install + auth. -- At least one model-provider API key, matched to the model you pick (table in "Model selection" below). -- Node 20+ for both the CLI install and the library. - -## Install - -### CLI - -```bash -# Global install — gives you the `cua` binary on $PATH -npm i -g @onkernel/cua-cli - -# Or zero-install one-shot -npx -y -p @onkernel/cua-cli cua --help -``` - -### Library - -```bash -npm i @onkernel/cua-agent @onkernel/cua-ai @onkernel/sdk -``` - -## Environment variables - -| Env | Used for | -| --- | --- | -| `KERNEL_API_KEY` | Kernel API key (always required) | -| `OPENAI_API_KEY` | OpenAI models (`-m openai:…`) | -| `ANTHROPIC_API_KEY` | Anthropic models (`-m anthropic:…`); `ANTHROPIC_OAUTH_TOKEN` also works | -| `GOOGLE_API_KEY` / `GEMINI_API_KEY` | Google / Gemini models (`-m google:…`) | -| `YUTORI_API_KEY` | Yutori Navigator (`-m yutori:…`) | -| `TZAFON_API_KEY` | Tzafon (`-m tzafon:…`) | -| `KERNEL_BASE_URL` | Override Kernel base URL | -| `XDG_DATA_HOME` | CLI sessions/transcripts dir (defaults to `~/.local/share`) | -| `CUA_IMAGE_PROTOCOL` | Force inline image protocol (`kitty` / `iterm2` / `none` / `auto`) | - -The library auto-loads these via `getCuaEnvApiKey` if you don't pass explicit auth callbacks. - -## CLI: one-shot subcommands - -Each call provisions a fresh Kernel browser by default, runs the action, prints a one-line result, and tears the browser down. Chain via `-s ` (next section) to keep state. - -| Subcommand | What it does | Stdout | Exit code | -| --- | --- | --- | --- | -| `cua open ` | Navigate to a URL. | `ok` | 0 ok, 2 error | -| `cua click ""` | Find element matching natural-language description and click it. | `ok clicked (x, y)` or `not_found ` | 0 ok, 1 not_found, 2 error | -| `cua type "" ""` | Focus a field by description and type. | `ok typed` or `not_found ` | 0 ok, 1 not_found, 2 error | -| `cua press [...]` | Send a key combo (`cua press ctrl l`, `cua press Return`). | `ok pressed` | 0 ok, 2 error | -| `cua url` | Print the current URL. | the URL | 0 ok, 2 error | -| `cua observe [""]` | Describe the page; optionally answer a question. | the description | 0 ok, 2 error | -| `cua screenshot --out ` | Save a PNG. `--out -` writes bytes to stdout. | the path or `(stdout)` | 0 ok, 2 error | -| `cua do ""` | Open-ended; agent plans and acts. Bound by `--max-steps` (default 3). | the assistant's final text | 0 ok, 2 error | - -Useful flags: - -- `-m ` — pick the LLM (default `openai:gpt-5.5`). `cua models` to list. -- `--max-steps ` — bound the loop on `cua do`. -- `--profile ` — load a Kernel browser profile for persisted cookies / storage. Existing ids or names are reused; a non-id name is created if missing. Pass `--profile-no-save-changes` for read-only. -- `-v` — verbose progress on stderr (provisioning, tool calls, transcript path). - -`click` and `type` match **semantically**, not by selector — use natural-language descriptions of what's visible on screen. - -## CLI: named sessions - -Without `-s`, each subcommand provisions a brand-new browser. To keep state (cookies, URL, scroll position) across calls, allocate a named session first: - -```bash -cua --profile github session start login # provisions a Kernel browser, prints `name=login` -cua -s login open https://github.com/login -cua -s login type "email field" "$EMAIL" -cua -s login type "password field" "$PASSWORD" -cua -s login click "Sign in" -cua -s login url # prints post-login URL -cua session stop login # tears down the Kernel browser -``` - -Inspect: - -```bash -cua session list # NAME / KERNEL_ID / AGE / LIVE_URL -cua session show login # full JSON metadata -``` - -Pass `--profile` when starting the named session; later `cua -s …` calls attach to the same browser, so they don't need the profile flag. - -**Liveness**: Kernel browsers time out from inactivity. If you see `error session "" is no longer alive on Kernel …`, run `cua session stop && cua --profile session start ` to re-provision with the same persisted profile. - -Named-session metadata lives in `$XDG_DATA_HOME/cua/named-sessions/.json`. - -## CLI: free-form mode - -```bash -cua --print "open hn and tell me the top story" # one-shot, streams text -cua --print -o jsonl "..." # one-shot, streams JSONL events -cua "..." # interactive TUI (real terminal) -``` - -`--print` exits when the agent finishes; the TUI runs until Ctrl+C. Add `--jsonl-include-deltas` for token deltas, `--jsonl-include-images` for base64 screenshots in `tool_result` events. - -## Library: quick start with `CuaAgentHarness` - -The harness is the recommended entry point. It owns the session, persists every turn, handles steering / follow-up, and can swap providers mid-conversation. - -```ts -import Kernel from "@onkernel/sdk"; -import { - CuaAgentHarness, - InMemorySessionRepo, - NodeExecutionEnv, -} from "@onkernel/cua-agent"; -import type { AssistantMessage } from "@onkernel/cua-ai"; - -const client = new Kernel({ apiKey: process.env.KERNEL_API_KEY! }); -const browser = await client.browsers.create({ stealth: true }); - -const repo = new InMemorySessionRepo(); -const session = await repo.create({ id: "research" }); - -const harness = new CuaAgentHarness({ - browser, - client, - env: new NodeExecutionEnv({ cwd: process.cwd() }), - model: "openai:gpt-5.5", - session, -}); - -const textOf = (m: AssistantMessage) => - m.content.flatMap((b) => (b.type === "text" ? [b.text] : [])).join("").trim(); - -const first = await harness.prompt("Open example.com and describe what you see."); -console.log(textOf(first)); - -// Swap providers mid-session — CUA tools and the default prompt refresh. -await harness.setModel("anthropic:claude-opus-4-7"); -const second = await harness.prompt("Open the most relevant link from what you found."); -console.log(textOf(second)); - -await client.browsers.deleteByID(browser.session_id); -``` - -While a turn is running: `steer()` injects course corrections, `followUp()` queues the next instruction, `subscribe()` streams underlying agent events, and `compact()` collapses long transcripts. - -### When to use `CuaAgent` instead - -Reach for `CuaAgent` (extends pi `Agent`) when you want raw control — direct `state.messages` access, custom streaming, explicit prompt/continue/queue, no session repo. The shape is the same except you assign `agent.state.model = …` instead of calling `setModel()`. - -```ts -import { CuaAgent } from "@onkernel/cua-agent"; - -const agent = new CuaAgent({ - browser, - client, - initialState: { - model: "openai:gpt-5.5", - systemPrompt: "You are a careful browser automation agent.", - }, -}); - -agent.subscribe((event) => { /* … */ }); -await agent.prompt("Open news.ycombinator.com and summarize the top story."); -``` - -### CLI vs library vs raw `CuaAgent` - -| You want to … | Use | -| --- | --- | -| Drive cua from shell scripts | CLI | -| Open-ended TUI session | CLI (`cua` no args) | -| Embed cua inside a TS app with session-backed turns | `CuaAgentHarness` | -| Add your own pi tools alongside computer use | `CuaAgentHarness` (`extraTools`) or `CuaAgent` | -| Raw pi `Agent` semantics: own message state, lifecycle events | `CuaAgent` | - -## Model selection - -Run `cua models` (or `listCuaModels()` from `@onkernel/cua-ai`) for the current catalog. As of writing, the four supported providers and their built-in computer-use vocabularies: - -| Model ref | Provider | Notes | -| --- | --- | --- | -| `openai:gpt-5.5` | OpenAI | Built-in `computer` tool; default in CLI. | -| `anthropic:claude-opus-4-7` | Anthropic | Built-in `computer_20251124` tool. Supports `--thinking` levels. | -| `google:gemini-3-flash-preview` | Google | Predefined computer-use functions with 0–1000 normalized coords. | -| `yutori:n1.5-latest` | Yutori | OpenAI-compatible chat with browser action tool calls. | - -Switching models mid-turn: - -- CLI: re-run with `-m `, or attach a `-s` named session with a different `-m` per call. -- Library (harness): `await harness.setModel("anthropic:claude-opus-4-7")` — CUA tools and the default system prompt refresh. -- Library (agent): assign `agent.state.model = "anthropic:claude-opus-4-7"`. - -Not every provider's native vocabulary includes navigation. Pass `computerUseExtra: true` to add the provider-neutral `computer_use_extra` tool (`goto`, `back`, `forward`, `url`) when you need it on a model that lacks built-in navigation. - -## Browser config - -The CLI always provisions stealth-on browsers and exposes profile persistence via `--profile` / `--profile-no-save-changes`. For any other browser knob — non-stealth, custom viewport, proxy, custom timeout — use the library and provision the browser yourself: - -```ts -const browser = await client.browsers.create({ - stealth: true, // CLI hardcodes this on; flip to false only via the library - headless: false, // headful => live view URL; headless => no live view, smaller image - timeout: 1800, // seconds before the Kernel browser auto-times-out - profile: { name: "github", save_changes: true }, // load + save persisted state - // proxy: { ... }, // optional outbound proxy -}); -``` - -If you need a custom-provisioned browser from the CLI, pre-create it with `kernel browsers create` and attach via `cua session …` — see the kernel-cli skill for the create flag reference. - -## Adding your own tools - -Pass any pi `AgentTool` (see [`@earendil-works/pi-agent-core`](https://www.npmjs.com/package/@earendil-works/pi-agent-core) for the tool shape) via `extraTools`. The CUA defaults stay installed; your tools run alongside them. - -```ts -import type { AgentTool } from "@onkernel/cua-agent"; -import { CuaAgentHarness } from "@onkernel/cua-agent"; - -const lookupOrder: AgentTool = { - // shape per pi-agent-core docs: name, description, schema, run, ... -}; - -const harness = new CuaAgentHarness({ - browser, client, - model: "openai:gpt-5.5", - session, - env: new NodeExecutionEnv({ cwd: process.cwd() }), - extraTools: [lookupOrder], - computerUseExtra: true, -}); -``` - -Use `createCuaComputerTools()` directly if you want to compose the tool list yourself (e.g. wrap computer-use tools in a permission gate): - -```ts -import { resolveCuaRuntimeSpec } from "@onkernel/cua-ai"; -import { createCuaComputerTools } from "@onkernel/cua-agent"; - -const runtime = resolveCuaRuntimeSpec("openai:gpt-5.5"); -const tools = [ - ...createCuaComputerTools({ browser, client, toolExecutors: runtime.toolExecutors }), - lookupOrder, -]; -``` - -## Live view URL and manual login fallback - -cua's `--profile` (CLI) and `profile` (library) handle most login persistence, but stealth doesn't always beat bot detection. When automation gets stuck on a login, hand off to a human via the live view URL. - -### CLI - -```bash -cua --profile mysite session start login -cua session show login | jq -r .live_url # share this URL with the user -# user logs in manually in their browser via the live view -cua -s login url # confirm the post-login URL -cua session stop login # profile state saves on teardown -``` - -### Library - -Every Kernel browser response carries the live view URL on creation: - -```ts -const browser = await client.browsers.create({ stealth: true, headless: false }); -console.log("live view:", browser.browser_live_view_url); -// share that URL, wait for the user to finish manual login, then prompt the agent -``` - -If you only have a session id, fetch it: - -```bash -kernel browsers view -``` - -## Cross-origin iframes / Playwright escape hatch - -cua drives by clicking pixels, so cross-origin iframes (payment forms, embedded vendor widgets) work in the screenshot flow without special handling — the model just clicks them. When you need a deterministic Playwright action against the underlying browser (e.g. to fill a card form via a fixed selector), break out to Kernel's exec endpoint with the session id: - -```bash -# CLI: find the session id -cua session show login | jq -r .kernel_session_id - -# Run a Playwright snippet against the same browser -kernel browsers exec --code " - const frame = page.frameLocator('#payment-iframe'); - await frame.locator('#card-number').fill('4111111111111111'); - await frame.locator('#submit').click(); -" -``` - -From the library, you already have `browser.session_id` and the Kernel client — call the same exec endpoint via the SDK. - -## Debugging - -- **CLI verbose**: `cua -v --print "…"` writes provisioning info, tool calls, and the transcript path to stderr. -- **Live event stream**: `cua --print -o jsonl "…"` emits one event per line (`tool_call`, `tool_result`, `assistant_text_done`, etc.). Add `--jsonl-include-images` to inline screenshots in `tool_result`. -- **Persisted transcript**: every `--print`, TUI, and `-s ` invocation appends to `$XDG_DATA_HOME/cua/sessions//.jsonl`. Exact path: - ```bash - cua -v --print "..." # stderr includes: [cua] session= - cua session show login | jq -r .transcript_path - ``` - Roles: `user`, `assistant`, `toolResult`. There's also a custom `cua-browser` entry written once per session with `kernel_session_id` / `live_url` / `profile_id`. -- **Library event subscription**: - ```ts - harness.subscribe((event) => { - // event.type === "tool_call" | "tool_result" | "assistant_text_done" | ... - }); - ``` -- **Screenshots**: `cua screenshot --out shot.png` (CLI) or inspect the `image` blocks in `toolResult` transcript entries. -- **Page URL**: `cua url` to confirm post-action navigation. `agent.state.messages` (library) holds the full message history. - -A couple of `jq` starters against a transcript path: - -```bash -# Every tool call the agent made, in order -jq -c 'select(.role == "assistant") | .content[]? - | select(.type == "tool_use") | {name, input}' "$TRANSCRIPT" - -# Final assistant text (the answer) -jq -r 'select(.role == "assistant") | .content[]? - | select(.type == "text") | .text' "$TRANSCRIPT" | tail -1 -``` - -## Gotchas - -- **Element descriptions are semantic, not selectors.** `cua click "Sign in button"` looks at the screenshot — describe what the user sees, not a CSS selector. -- **Viewport defaults to 1920x1080.** Resize via `client.browsers.create({ ... })` flags if you need something else. -- **Keyboard navigation > mouse-wheel scroll.** `cua press Page_Down` / `Home` / arrow keys is more reliable than scroll wheel via the LLM. -- **Multi-step state requires `-s` (CLI) or a session-backed harness (library).** A second one-shot subcommand can't see what the first one did. -- **Profile saves on close, not continuously.** Tear down cleanly (`cua session stop`, `client.browsers.deleteByID`) or you'll lose recent state. -- **Provider tool vocab gaps.** If a model can click and type but can't navigate, set `computerUseExtra: true` (library) or pick a different model. -- **`--max-steps` defaults to 3 on `cua do`.** Bump it for non-trivial tasks. - -## Quick reference - -```bash -# CLI quickstart — one-shot, fresh browser -cua --print "open hn and tell me the top story" - -# CLI — named session for multi-step -cua --profile mysite session start work -cua -s work open https://example.com -cua -s work click "Log in" -cua -s work type "email field" "$EMAIL" -cua -s work click "Submit" -cua -s work url -cua session stop work - -# CLI — list models, switch model per call -cua models -cua --print -m anthropic:claude-opus-4-7 "..." - -# Get the live view URL -cua session show work | jq -r .live_url -kernel browsers view # alternative -``` - -```ts -// Library — minimal harness -import { CuaAgentHarness, InMemorySessionRepo, NodeExecutionEnv } from "@onkernel/cua-agent"; - -const session = await new InMemorySessionRepo().create({ id: "main" }); -const harness = new CuaAgentHarness({ - browser, client, session, - env: new NodeExecutionEnv({ cwd: process.cwd() }), - model: "openai:gpt-5.5", -}); -const result = await harness.prompt("Open example.com and click the first link."); -``` From 83093fe63b2897ffec14c8a37c17a2e024c86758 Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Sun, 21 Jun 2026 15:57:28 +0000 Subject: [PATCH 4/9] move cua-cli into kernel-cli, cua-agent into kernel-sdks Drop the standalone `cua` plugin. The repo organizes skills by audience (shell-driving vs SDK-authoring), not by product, so cua-cli sits alongside kernel-agent-browser in kernel-cli, and cua-agent sits alongside kernel-typescript-sdk in kernel-sdks. Users who already install those plugins get the cua skills for free. - plugins/kernel-cli/skills/cua-cli/SKILL.md (moved) - plugins/kernel-sdks/skills/cua-agent/SKILL.md (moved) - plugins/cua/ deleted - README install snippets and skill tables updated - cross-skill links in both SKILL.md files updated to reference the new plugin locations Co-Authored-By: Claude Opus 4.7 --- README.md | 13 +------------ plugins/cua/.claude-plugin/plugin.json | 11 ----------- plugins/{cua => kernel-cli}/skills/cua-cli/SKILL.md | 2 +- .../{cua => kernel-sdks}/skills/cua-agent/SKILL.md | 2 +- 4 files changed, 3 insertions(+), 25 deletions(-) delete mode 100644 plugins/cua/.claude-plugin/plugin.json rename plugins/{cua => kernel-cli}/skills/cua-cli/SKILL.md (98%) rename plugins/{cua => kernel-sdks}/skills/cua-agent/SKILL.md (98%) diff --git a/README.md b/README.md index d2c203d..3057be6 100644 --- a/README.md +++ b/README.md @@ -18,9 +18,6 @@ Official AI agent skills from the Kernel for installing useful skills for our CL # Install the video generation skill /plugin install generate-video - -# Install the cua skill (CLI + library for computer-use on Kernel) -/plugin install cua ``` ### Cursor @@ -45,7 +42,6 @@ git clone https://github.com/kernel/skills.git cp -r skills/plugins/kernel-cli ~/.claude/skills/ cp -r skills/plugins/kernel-sdks ~/.claude/skills/ cp -r skills/plugins/generate-video ~/.claude/skills/ -cp -r skills/plugins/cua ~/.claude/skills/ ``` ## Prerequisites @@ -76,6 +72,7 @@ Command-line interface skills for using Kernel CLI commands. | **kernel-cli** | Complete guide to Kernel CLI - cloud browser platform with automation, deployment, and management | | **kernel-agent-browser** | Best practices for `agent-browser -p kernel` automation, bot detection handling, iframes, login persistence | | **kernel-auth** | Setup and manage Kernel authentication connections for any website with safety checks and reauthentication support | +| **cua-cli** | Drive a Kernel browser from shell via the `cua` binary: one-shot subcommands, named sessions, TUI, profile persistence, transcripts, live-view handoff | | **profile-website-bot-detection** | Profile a website for bot detection vendors using stealth vs non-stealth Kernel browsers; compare effectiveness and identify vendor products | ### kernel-sdks @@ -86,14 +83,6 @@ SDK skills for building browser automation with TypeScript and Python. |-------|-------------| | **typescript-sdk** | Build automation with Kernel's Typescript SDK | | **python-sdk** | Build automation with kernel's Python SDK | - -### cua - -Computer-use loop for Kernel cloud browsers — CLI for shell-driven automation and the `@onkernel/cua-agent` TS library for embedding in your own agents. One plugin, two skills (load whichever matches the task). - -| Skill | Description | -|-------|-------------| -| **cua-cli** | Drive a Kernel browser from shell via the `cua` binary: one-shot subcommands, named sessions, TUI, profile persistence, transcripts, live-view handoff | | **cua-agent** | Build TypeScript apps that embed Kernel cua's loop with `CuaAgent` / `CuaAgentHarness`: provider switching, custom tools, session repos, event-stream debugging | ### generate-video diff --git a/plugins/cua/.claude-plugin/plugin.json b/plugins/cua/.claude-plugin/plugin.json deleted file mode 100644 index 436bab5..0000000 --- a/plugins/cua/.claude-plugin/plugin.json +++ /dev/null @@ -1,11 +0,0 @@ -{ - "name": "cua", - "version": "1.0.0", - "description": "Kernel cua skills: `cua-cli` for shell-driven computer-use automation via the `cua` binary, and `cua-agent` for building TypeScript apps on the @onkernel/cua-agent library (CuaAgent / CuaAgentHarness)", - "author": { - "name": "Kernel", - "url": "www.kernel.sh" - }, - "repository": "https://github.com/kernel/skills", - "license": "MIT" -} diff --git a/plugins/cua/skills/cua-cli/SKILL.md b/plugins/kernel-cli/skills/cua-cli/SKILL.md similarity index 98% rename from plugins/cua/skills/cua-cli/SKILL.md rename to plugins/kernel-cli/skills/cua-cli/SKILL.md index 527b826..29a0e6a 100644 --- a/plugins/cua/skills/cua-cli/SKILL.md +++ b/plugins/kernel-cli/skills/cua-cli/SKILL.md @@ -10,7 +10,7 @@ description: Drive a Kernel cloud browser from the shell using the `cua` CLI. Us ## When to use this skill - **Use this skill** when you need shell-callable computer-use steps (`cua open`, `cua click`, `cua do …`), an interactive TUI, or want to chain browser actions in a shell pipeline. -- **Reach for [`cua-agent`](../cua-agent/SKILL.md)** when you're writing a TypeScript app that needs to embed cua's prompt → screenshot → tool-call loop programmatically. +- **Reach for the `cua-agent` skill** (in the `kernel-sdks` plugin) when you're writing a TypeScript app that needs to embed cua's prompt → screenshot → tool-call loop programmatically. - **Reach for `kernel-agent-browser`** when you need deterministic browser scripting (semantic selectors, `find role`, `wait --text`, accessibility-tree snapshots). - **Reach for `kernel-cli`** for raw Kernel browser management (`kernel browsers create`, `kernel browsers exec`, profile / proxy CRUD). diff --git a/plugins/cua/skills/cua-agent/SKILL.md b/plugins/kernel-sdks/skills/cua-agent/SKILL.md similarity index 98% rename from plugins/cua/skills/cua-agent/SKILL.md rename to plugins/kernel-sdks/skills/cua-agent/SKILL.md index 322f908..46696a3 100644 --- a/plugins/cua/skills/cua-agent/SKILL.md +++ b/plugins/kernel-sdks/skills/cua-agent/SKILL.md @@ -15,7 +15,7 @@ Both translate per-provider computer-use tool calls (OpenAI's `computer`, Anthro ## When to use this skill - **Use this skill** when writing TS code that embeds cua inside a larger app, needs a custom session repo, runs its own pi tools alongside computer use, or reacts to per-event streams programmatically. -- **Reach for [`cua-cli`](../cua-cli/SKILL.md)** when shell-callable computer-use is enough (`cua open`, `cua click`, `cua do`). +- **Reach for the `cua-cli` skill** (in the `kernel-cli` plugin) when shell-callable computer-use is enough (`cua open`, `cua click`, `cua do`). - **Reach for `kernel-typescript-sdk`** for raw Playwright / CDP control over a Kernel browser without an LLM in the loop. ## Prerequisites From ef22c3cf593e74f188255807b62932cdbe7ff474 Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Tue, 30 Jun 2026 12:33:30 +0000 Subject: [PATCH 5/9] review: tighten cua-cli and cua-agent skills - normalize named-session name to `login` across examples - expand liveness recovery into a copy-pasteable snippet - warn agents off the interactive TUI; require `--print` - document provider env-key precedence (anthropic, google) - add `cua --help` discovery affordance - reframe "Playwright escape hatch" as "Mixing vision and DOM": when to use Playwright on the same browser, structured extraction example, AgentTool composition note - wrap CuaAgentHarness quick start in try/finally - flesh out empty AgentTool example with name/description/schema/run - clarify @earendil-works/pi-agent-core is reference-only (not installed) - soften `computerUseExtra` story; default it on in quick reference --- plugins/kernel-cli/skills/cua-cli/SKILL.md | 62 ++++++++++++------ plugins/kernel-sdks/skills/cua-agent/SKILL.md | 64 ++++++++++++++----- 2 files changed, 90 insertions(+), 36 deletions(-) diff --git a/plugins/kernel-cli/skills/cua-cli/SKILL.md b/plugins/kernel-cli/skills/cua-cli/SKILL.md index 29a0e6a..2745aa4 100644 --- a/plugins/kernel-cli/skills/cua-cli/SKILL.md +++ b/plugins/kernel-cli/skills/cua-cli/SKILL.md @@ -36,8 +36,8 @@ npx -y -p @onkernel/cua-cli cua --help | --- | --- | | `KERNEL_API_KEY` | Kernel API key (always required) | | `OPENAI_API_KEY` | OpenAI models (`-m openai:…`) | -| `ANTHROPIC_API_KEY` | Anthropic models (`-m anthropic:…`) | -| `GOOGLE_API_KEY` / `GEMINI_API_KEY` | Google / Gemini models (`-m google:…`) | +| `ANTHROPIC_OAUTH_TOKEN` / `ANTHROPIC_API_KEY` | Anthropic models (`-m anthropic:…`); OAuth token wins if both are set | +| `GOOGLE_API_KEY` / `GEMINI_API_KEY` | Google / Gemini models (`-m google:…`); `GOOGLE_API_KEY` wins if both are set | | `YUTORI_API_KEY` | Yutori Navigator (`-m yutori:…`) | | `TZAFON_API_KEY` | Tzafon (`-m tzafon:…`) | | `KERNEL_BASE_URL` | Override Kernel base URL | @@ -70,6 +70,8 @@ Useful flags: The cua CLI always provisions **stealth-on** browsers. If you need non-stealth or a custom viewport / proxy, pre-create the browser via `kernel browsers create` and attach the cua session to it. +If you're unsure of a flag or subcommand, `cua --help` and `cua --help` print the current surface. + ## Named sessions Without `-s`, each subcommand provisions a brand-new browser. To keep state across calls, allocate a named session first: @@ -93,7 +95,12 @@ cua session show login # full JSON metadata Pass `--profile` when starting the named session; later `cua -s …` calls attach to the same browser, so they don't need the profile flag again. -**Liveness**: Kernel browsers time out from inactivity. If you see `error session "" is no longer alive on Kernel …`, run `cua session stop && cua --profile session start ` to re-provision with the same persisted profile. +**Liveness**: Kernel browsers time out from inactivity. If you see `error session "login" is no longer alive on Kernel …`, re-provision with the same profile and name: + +```bash +cua session stop login # safe even if the Kernel browser is already gone +cua --profile github session start login # re-attach name=login to a fresh browser, same profile +``` Named-session metadata lives in `$XDG_DATA_HOME/cua/named-sessions/.json`. @@ -107,6 +114,8 @@ cua "..." # interactive TUI (real termin `--print` exits when the agent finishes; the TUI runs until Ctrl+C. Add `--jsonl-include-deltas` for token deltas, `--jsonl-include-images` for base64 screenshots in `tool_result` events. +**If you're an agent driving cua from a shell, always pass `--print` or `--print -o jsonl`.** The bare `cua "..."` form opens an interactive TUI that needs a real terminal — it will hang in a non-interactive context. + ## Model selection Run `cua models` for the current catalog. Pick with `-m ` (default `openai:gpt-5.5`). Switch per call or per named session. @@ -138,22 +147,37 @@ If you only have a session id (e.g. from `cua session list`), the `kernel` CLI a kernel browsers view ``` -## Cross-origin iframes / Playwright escape hatch +## Mixing vision and DOM (Playwright on the same browser) + +cua's strength is semantic, vision-driven interaction — describe what's on screen, the model finds it. Playwright's strength is deterministic DOM access — exact selectors, structured data extraction, file uploads, network interception. Real workflows often need both, and the named-session model is built for it: every `cua -s ` session exposes a `kernel_session_id` that points at the same underlying Kernel browser, so you can interleave vision turns and Playwright snippets without losing state. -cua drives by clicking pixels, so cross-origin iframes (payment forms, embedded vendor widgets) work in the screenshot flow without special handling — the model just clicks them. When you need a deterministic Playwright action against the underlying browser (e.g. fill a card form via a fixed selector), break out to Kernel's exec endpoint with the session id: +Reach for Playwright on the cua browser when: + +- you need a **fixed selector** (form auto-fill, hidden inputs, file uploads, attribute reads). +- you want **structured extraction** (`page.$$eval` over a list) rather than asking the model to read pixels. +- you're driving a **cross-origin iframe** with a known DOM contract (payment widgets, SSO popups). cua can click iframes too, but Playwright gives you `frameLocator()` and structured assertions. +- you need to **wait on a network response or DOM condition** rather than a visual cue. ```bash -# Find the session id -cua session show login | jq -r .kernel_session_id +# Vision turns to get logged in and to the right page +cua --profile mysite session start login +cua -s login open https://example.com/checkout +cua -s login click "Continue to payment" -# Run a Playwright snippet against the same browser +# Same browser, DOM-precise card fill +cua session show login | jq -r .kernel_session_id # → kernel browsers exec --code " const frame = page.frameLocator('#payment-iframe'); - await frame.locator('#card-number').fill('4111111111111111'); + await frame.locator('#card-number').fill(process.env.CARD_NUMBER); await frame.locator('#submit').click(); " + +# Hand control back to vision for the confirmation flow +cua -s login observe "did the payment succeed?" ``` +State (URL, cookies, storage) is shared because it's the same browser — vision and DOM steps see each other's effects. + ## Debugging - **Verbose stderr**: `cua -v --print "…"` writes provisioning info, tool calls, and the transcript path to stderr. @@ -195,23 +219,23 @@ jq -r 'select(.role == "assistant") | .content[]? cua --print "open hn and tell me the top story" # Named session for multi-step -cua --profile mysite session start work -cua -s work open https://example.com -cua -s work click "Log in" -cua -s work type "email field" "$EMAIL" -cua -s work click "Submit" -cua -s work url -cua session stop work +cua --profile github session start login +cua -s login open https://example.com +cua -s login click "Log in" +cua -s login type "email field" "$EMAIL" +cua -s login click "Submit" +cua -s login url +cua session stop login # List models, switch model per call cua models cua --print -m anthropic:claude-opus-4-7 "..." # Get the live view URL -cua session show work | jq -r .live_url +cua session show login | jq -r .live_url kernel browsers view # alternative -# Drop to Playwright for deterministic actions -cua session show work | jq -r .kernel_session_id +# Mix in a Playwright/DOM action against the same browser +cua session show login | jq -r .kernel_session_id kernel browsers exec --code "..." ``` diff --git a/plugins/kernel-sdks/skills/cua-agent/SKILL.md b/plugins/kernel-sdks/skills/cua-agent/SKILL.md index 46696a3..79b8ad5 100644 --- a/plugins/kernel-sdks/skills/cua-agent/SKILL.md +++ b/plugins/kernel-sdks/skills/cua-agent/SKILL.md @@ -36,7 +36,7 @@ The three packages divide responsibility: - `@onkernel/cua-ai` — model catalog (`getCuaModel` / `listCuaModels`), canonical CUA tool schemas, per-provider adapters. - `@onkernel/sdk` — Kernel SDK client used to provision the browser. -Both classes re-export the full pi-agent-core surface from `@onkernel/cua-agent`, including `NodeExecutionEnv` (via the `/node` subpath under the hood) and `InMemorySessionRepo`. Import them from `@onkernel/cua-agent` directly. +Both classes re-export the full pi-agent-core surface from `@onkernel/cua-agent`, including `NodeExecutionEnv` (via the `/node` subpath under the hood) and `InMemorySessionRepo`. Import them from `@onkernel/cua-agent` directly — you don't install `@earendil-works/pi-agent-core` separately; it's the upstream reference docs only. ## Environment variables @@ -51,6 +51,8 @@ If you don't pass explicit auth callbacks, both classes resolve provider keys vi | `YUTORI_API_KEY` | `yutori:…` models | | `TZAFON_API_KEY` | `tzafon:…` models | +When a provider lists two env vars, `getCuaEnvApiKey` returns the first one set in this order: `ANTHROPIC_OAUTH_TOKEN` then `ANTHROPIC_API_KEY`; `GOOGLE_API_KEY` then `GEMINI_API_KEY`. + ## Quick start — `CuaAgentHarness` ```ts @@ -79,15 +81,17 @@ const harness = new CuaAgentHarness({ const textOf = (m: AssistantMessage) => m.content.flatMap((b) => (b.type === "text" ? [b.text] : [])).join("").trim(); -const first = await harness.prompt("Open example.com and describe what you see."); -console.log(textOf(first)); - -// Swap providers mid-session — CUA tools and the default prompt refresh. -await harness.setModel("anthropic:claude-opus-4-7"); -const second = await harness.prompt("Open the most relevant link from what you found."); -console.log(textOf(second)); +try { + const first = await harness.prompt("Open example.com and describe what you see."); + console.log(textOf(first)); -await client.browsers.deleteByID(browser.session_id); + // Swap providers mid-session — CUA tools and the default prompt refresh. + await harness.setModel("anthropic:claude-opus-4-7"); + const second = await harness.prompt("Open the most relevant link from what you found."); + console.log(textOf(second)); +} finally { + await client.browsers.deleteByID(browser.session_id); +} ``` While a turn is running: `steer()` injects course corrections, `followUp()` queues the next instruction, `subscribe()` streams underlying agent events, and `compact()` collapses long transcripts. See [`@earendil-works/pi-agent-core`](https://www.npmjs.com/package/@earendil-works/pi-agent-core) for the full harness lifecycle. @@ -145,7 +149,7 @@ agent.state.model = "anthropic:claude-opus-4-7"; In both cases CUA-owned tools and the default system prompt refresh for the next provider request. -Not every provider's native vocabulary includes navigation (`goto`, `back`, `forward`, `url`). Pass `computerUseExtra: true` to add the provider-neutral `computer_use_extra` tool when the model can click/type but can't navigate. +Not every provider's native computer-use vocab includes navigation (`goto`, `back`, `forward`, `url`). Pass `computerUseExtra: true` on the harness or agent to add the provider-neutral `computer_use_extra` tool — safe to leave on by default; it's a no-op for providers whose native tools already cover navigation. The Quick reference at the bottom shows it set. ## Browser provisioning @@ -174,11 +178,18 @@ The `browser.browser_live_view_url` field on the create response is the URL to s Pass any pi `AgentTool` (see [`@earendil-works/pi-agent-core`](https://www.npmjs.com/package/@earendil-works/pi-agent-core) for the tool shape) via `extraTools`. The CUA defaults stay installed; your tools run alongside them. ```ts +import { z } from "zod"; import type { AgentTool } from "@onkernel/cua-agent"; import { CuaAgentHarness } from "@onkernel/cua-agent"; const lookupOrder: AgentTool = { - // shape per pi-agent-core docs: name, description, schema, run, ... + name: "lookup_order", + description: "Look up an order in our backend by id.", + schema: z.object({ orderId: z.string() }), + run: async ({ orderId }) => { + const order = await db.orders.get(orderId); + return { content: [{ type: "text", text: JSON.stringify(order) }] }; + }, }; const harness = new CuaAgentHarness({ @@ -223,20 +234,39 @@ await harness.prompt("Now click 'Settings' and read me the current value of X.") Profile saves on browser teardown, so future runs with the same profile name skip the manual login. -## Cross-origin iframes / Playwright escape hatch +## Mixing vision and DOM (Playwright on the same browser) + +cua's strength is semantic, vision-driven interaction — describe what's on screen, the model finds it. Playwright's strength is deterministic DOM access — exact selectors, structured data extraction, file uploads, network interception. Real apps often need both, and the harness is built for it: you already hold the `browser.session_id`, so any Playwright snippet you ship through `client.browsers.exec` runs against the same browser the agent is driving. State (URL, cookies, storage) is shared. + +Reach for Playwright on the cua browser when: + +- you need a **fixed selector** (form auto-fill, hidden inputs, file uploads, attribute reads). +- you want **structured extraction** (`page.$$eval` over a list) rather than asking the model to read pixels. +- you're driving a **cross-origin iframe** with a known DOM contract (payment widgets, SSO popups). +- you need to **wait on a network response or DOM condition** rather than a visual cue. -cua drives by clicking pixels, so cross-origin iframes work in the screenshot flow without special handling. When you need a deterministic Playwright action against the underlying browser (e.g. fill a card form via a fixed selector), drop to the Kernel SDK's exec endpoint with the session id you already have: +A common pattern — vision turn to navigate, DOM turn to extract, then back to vision: ```ts -await client.browsers.exec(browser.session_id, { +await harness.prompt("Search for 'wireless headphones' and open the results page."); + +const products = await client.browsers.exec(browser.session_id, { code: ` - const frame = page.frameLocator('#payment-iframe'); - await frame.locator('#card-number').fill('4111111111111111'); - await frame.locator('#submit').click(); + return await page.$$eval('[data-product-id]', els => + els.map(el => ({ + id: el.dataset.productId, + title: el.querySelector('.title')?.textContent, + price: el.querySelector('.price')?.textContent, + })) + ); `, }); + +await harness.prompt(`Click the cheapest product from this list: ${JSON.stringify(products)}`); ``` +You can also wire Playwright work in as an `AgentTool` (see "Adding your own tools") so the model itself decides when to switch modes — useful when "do I have a stable selector for this?" is part of the task, not a fixed plan. + ## Debugging - **`subscribe()`** — the harness and agent both stream pi-agent-core events. Use it to log tool calls, screenshot sizes, tokens: From 0830329b396f2f5050ea59c20a3eec27f057f6ba Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Tue, 30 Jun 2026 15:25:29 +0000 Subject: [PATCH 6/9] deslop: drop duplicated state-sharing line in cua-cli mixing section The trailing "State is shared because it's the same browser" sentence restated the same point already made in the opening paragraph. Fold the URL/cookies/storage detail into the opener and drop the duplicate. --- plugins/kernel-cli/skills/cua-cli/SKILL.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/plugins/kernel-cli/skills/cua-cli/SKILL.md b/plugins/kernel-cli/skills/cua-cli/SKILL.md index 2745aa4..7d6148b 100644 --- a/plugins/kernel-cli/skills/cua-cli/SKILL.md +++ b/plugins/kernel-cli/skills/cua-cli/SKILL.md @@ -149,7 +149,7 @@ kernel browsers view ## Mixing vision and DOM (Playwright on the same browser) -cua's strength is semantic, vision-driven interaction — describe what's on screen, the model finds it. Playwright's strength is deterministic DOM access — exact selectors, structured data extraction, file uploads, network interception. Real workflows often need both, and the named-session model is built for it: every `cua -s ` session exposes a `kernel_session_id` that points at the same underlying Kernel browser, so you can interleave vision turns and Playwright snippets without losing state. +cua's strength is semantic, vision-driven interaction — describe what's on screen, the model finds it. Playwright's strength is deterministic DOM access — exact selectors, structured data extraction, file uploads, network interception. Real workflows often need both, and the named-session model is built for it: every `cua -s ` session exposes a `kernel_session_id` that points at the same underlying Kernel browser, so you can interleave vision turns and Playwright snippets. State (URL, cookies, storage) is shared. Reach for Playwright on the cua browser when: @@ -176,8 +176,6 @@ kernel browsers exec --code " cua -s login observe "did the payment succeed?" ``` -State (URL, cookies, storage) is shared because it's the same browser — vision and DOM steps see each other's effects. - ## Debugging - **Verbose stderr**: `cua -v --print "…"` writes provisioning info, tool calls, and the transcript path to stderr. From 2a84841ddb2f815ea467337c8097a4a214b6c52e Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Tue, 30 Jun 2026 15:28:36 +0000 Subject: [PATCH 7/9] README: link Kernel to kernel.sh in tagline --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3057be6..8c148e6 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # Kernel Skills -Official AI agent skills from the Kernel for installing useful skills for our CLI and SDKs that you can load into popular coding agents. +Official AI agent skills from [Kernel](https://kernel.sh) for installing useful skills for our CLI and SDKs that you can load into popular coding agents. ## Installation From c14d5dc4f70610b57b6c5b62a50bea5b57c26912 Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Wed, 1 Jul 2026 16:31:23 +0000 Subject: [PATCH 8/9] reshape: put playwright_execute on equal footing with computer use MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Both skills previously showed DOM operations only through the developer-scripted client.browsers.exec / kernel browsers exec path, which requires the developer to hard-code the vision/DOM split ahead of time. cua-agent (playwright: true) and cua-cli (--playwright) both ship a built-in playwright_execute tool that gets exposed alongside computer-use tools — the model picks per action which one fits the step. This is the model-driven mixed loop shape. Changes: - Add --playwright flag row to cua-cli's Useful flags - Note /playwright on|off TUI slash command for mid-session toggling - Rewrite "Mixing vision and DOM" in both skills: - Lead with a peer-tools framing and a decision rubric a system prompt can carry - Feature the model-picked path (--playwright / playwright: true) as the primary shape - Demote the developer-scripted split via browsers.exec as an alternative for repeatable batch jobs or known-DOM extraction - Add playwright: true to cua-agent Quick start and Quick reference - Consolidate computerUseExtra + playwright as sibling opt-in flags in a single Model-selection callout - Clarify that extraTools is for domain tools; Playwright is not a hand-rolled entry - Add gotcha about fresh JS context per playwright_execute call - Note verified-provider matrix (Anthropic/Tzafon/Yutori e2e; OpenAI and Google unit-tested) so readers know coverage - Update frontmatter descriptions and README rows to mention the flag --- plugins/kernel-cli/skills/cua-cli/SKILL.md | 52 ++++++++++------ plugins/kernel-sdks/skills/cua-agent/SKILL.md | 59 ++++++++++++++----- 2 files changed, 80 insertions(+), 31 deletions(-) diff --git a/plugins/kernel-cli/skills/cua-cli/SKILL.md b/plugins/kernel-cli/skills/cua-cli/SKILL.md index 7d6148b..aadda10 100644 --- a/plugins/kernel-cli/skills/cua-cli/SKILL.md +++ b/plugins/kernel-cli/skills/cua-cli/SKILL.md @@ -1,6 +1,6 @@ --- name: cua-cli -description: Drive a Kernel cloud browser from the shell using the `cua` CLI. Use this skill when you need to open URLs, click elements, type into fields, take screenshots, or chain multi-step browser tasks across shell calls. Supports named sessions for stateful workflows, profile persistence for logins, transcript-based debugging, and live-view handoff when stealth fails. For building your own TS agent on top of cua, see `cua-agent`. +description: Drive a Kernel cloud browser from the shell using the `cua` CLI. Use this skill when you need to open URLs, click elements, type into fields, take screenshots, or chain multi-step browser tasks across shell calls. Supports named sessions for stateful workflows, profile persistence for logins, transcript-based debugging, live-view handoff, and mixing vision-driven computer-use with `playwright_execute` for DOM-precise steps the model picks per action (`--playwright`). For building your own TS agent on top of cua, see `cua-agent`. --- # cua-cli @@ -64,6 +64,7 @@ Useful flags: - `-m ` — pick the LLM (default `openai:gpt-5.5`). `cua models` to list. - `--max-steps ` — bound the loop on `cua do`. - `--profile ` — load a Kernel browser profile for persisted cookies / storage. Existing ids or names are reused; a non-id name is created if missing. Pass `--profile-no-save-changes` for read-only. +- `--playwright` — expose the `playwright_execute` tool alongside computer-use so the model can run Playwright/TS against the live browser for steps that are cleaner as DOM operations (form fills, structured extraction, `waitForSelector`). Off by default. See "Mixing vision and DOM" below. - `-v` — verbose progress on stderr (provisioning, tool calls, transcript path). `click` and `type` match **semantically**, not by selector — use natural-language descriptions of what's visible on screen. @@ -116,6 +117,8 @@ cua "..." # interactive TUI (real termin **If you're an agent driving cua from a shell, always pass `--print` or `--print -o jsonl`.** The bare `cua "..."` form opens an interactive TUI that needs a real terminal — it will hang in a non-interactive context. +Inside the TUI, `/playwright on` and `/playwright off` toggle the `playwright_execute` tool mid-session without restarting. + ## Model selection Run `cua models` for the current catalog. Pick with `-m ` (default `openai:gpt-5.5`). Switch per call or per named session. @@ -147,33 +150,45 @@ If you only have a session id (e.g. from `cua session list`), the `kernel` CLI a kernel browsers view ``` -## Mixing vision and DOM (Playwright on the same browser) +## Mixing vision and DOM + +Vision (computer-use tools) and DOM (Playwright) are peer capabilities against the same browser. Real workloads mix both — a login flow may be visual and bot-detected while the data extraction on the other side is a stable table with reliable selectors. cua exposes both so the *model* picks per action: -cua's strength is semantic, vision-driven interaction — describe what's on screen, the model finds it. Playwright's strength is deterministic DOM access — exact selectors, structured data extraction, file uploads, network interception. Real workflows often need both, and the named-session model is built for it: every `cua -s ` session exposes a `kernel_session_id` that points at the same underlying Kernel browser, so you can interleave vision turns and Playwright snippets. State (URL, cookies, storage) is shared. +- **Reach for computer use when**: DOM is brittle or unknown, the target is canvas/video/pixel UI, bot detection makes human-like input matter, or the check is visual ("did the modal close?"). +- **Reach for Playwright when**: selectors are stable, you want structured extraction (`page.$$eval` over a list) instead of asking the model to read pixels, you're filling many form fields or hidden inputs, or you need to wait on a network/DOM condition rather than a visual cue. -Reach for Playwright on the cua browser when: +### Model-picked (`--playwright`) -- you need a **fixed selector** (form auto-fill, hidden inputs, file uploads, attribute reads). -- you want **structured extraction** (`page.$$eval` over a list) rather than asking the model to read pixels. -- you're driving a **cross-origin iframe** with a known DOM contract (payment widgets, SSO popups). cua can click iframes too, but Playwright gives you `frameLocator()` and structured assertions. -- you need to **wait on a network response or DOM condition** rather than a visual cue. +Pass `--playwright` and the model gets both toolsets and chooses per step: ```bash -# Vision turns to get logged in and to the right page -cua --profile mysite session start login -cua -s login open https://example.com/checkout -cua -s login click "Continue to payment" +cua --playwright --profile mysite session start work +cua -s work --print "log in to example.com, then pull the last 10 orders as JSON" +# The model picks computer_use for the login (visual, bot-detected) +# and playwright_execute for the extraction (structured, stable DOM). +``` + +Inside `playwright_execute`, `page`, `context`, and `browser` are in scope and the code may `return` a JSON-serializable value that comes back as the tool result. Each call runs in a fresh JS context (locals don't persist), but the browser session does. Screenshots aren't auto-attached — the model requests one on a follow-up turn when it needs to see the page. + +Verified end-to-end against Anthropic, Tzafon, and Yutori CUA models; OpenAI and Google are unit-tested. In the TUI, `/playwright on` / `/playwright off` toggle the tool mid-session. -# Same browser, DOM-precise card fill -cua session show login | jq -r .kernel_session_id # → +### Developer-scripted split + +If you'd rather orchestrate the split yourself instead of letting the model pick — for repeatable batch jobs, or when you know the DOM shape in advance — `kernel browsers exec` runs Playwright against the same `kernel_session_id`: + +```bash +cua --profile mysite session start work +cua -s work open https://example.com/checkout +cua -s work click "Continue to payment" + +cua session show work | jq -r .kernel_session_id # → kernel browsers exec --code " const frame = page.frameLocator('#payment-iframe'); await frame.locator('#card-number').fill(process.env.CARD_NUMBER); await frame.locator('#submit').click(); " -# Hand control back to vision for the confirmation flow -cua -s login observe "did the payment succeed?" +cua -s work observe "did the payment succeed?" ``` ## Debugging @@ -233,7 +248,10 @@ cua --print -m anthropic:claude-opus-4-7 "..." cua session show login | jq -r .live_url kernel browsers view # alternative -# Mix in a Playwright/DOM action against the same browser +# Let the model mix vision and DOM via playwright_execute +cua --playwright --print "log in and pull the last 10 orders as JSON" + +# Or orchestrate the split yourself against the same session cua session show login | jq -r .kernel_session_id kernel browsers exec --code "..." ``` diff --git a/plugins/kernel-sdks/skills/cua-agent/SKILL.md b/plugins/kernel-sdks/skills/cua-agent/SKILL.md index 79b8ad5..9858479 100644 --- a/plugins/kernel-sdks/skills/cua-agent/SKILL.md +++ b/plugins/kernel-sdks/skills/cua-agent/SKILL.md @@ -1,6 +1,6 @@ --- name: cua-agent -description: Build TypeScript apps that embed Kernel's computer-use loop with `@onkernel/cua-agent` — `CuaAgent` and `CuaAgentHarness` classes drive a Kernel cloud browser via prompt → screenshot → tool-call loops across OpenAI, Anthropic, Google, and Yutori provider tools. Use when writing TS code that needs computer-use against a Kernel browser, swapping providers mid-session, adding your own pi tools alongside computer use, or hooking into the agent event stream. For shell-callable cua, see `cua-cli`. +description: Build TypeScript apps that embed Kernel's computer-use loop with `@onkernel/cua-agent` — `CuaAgent` and `CuaAgentHarness` classes drive a Kernel cloud browser via prompt → screenshot → tool-call loops across OpenAI, Anthropic, Google, and Yutori provider tools. Use when writing TS code that needs computer-use against a Kernel browser, swapping providers mid-session, adding your own pi tools alongside computer use, mixing vision with `playwright_execute` (via `playwright: true`) so the model picks DOM or vision per action, or hooking into the agent event stream. For shell-callable cua, see `cua-cli`. --- # cua-agent @@ -76,6 +76,7 @@ const harness = new CuaAgentHarness({ env: new NodeExecutionEnv({ cwd: process.cwd() }), model: "openai:gpt-5.5", session, + playwright: true, // expose playwright_execute so the model can pick DOM or vision per action }); const textOf = (m: AssistantMessage) => @@ -106,6 +107,7 @@ import { CuaAgent } from "@onkernel/cua-agent"; const agent = new CuaAgent({ browser, client, + playwright: true, // same flags as CuaAgentHarness — computerUseExtra, playwright, extraTools, ... initialState: { model: "openai:gpt-5.5", systemPrompt: "You are a careful browser automation agent.", @@ -149,7 +151,12 @@ agent.state.model = "anthropic:claude-opus-4-7"; In both cases CUA-owned tools and the default system prompt refresh for the next provider request. -Not every provider's native computer-use vocab includes navigation (`goto`, `back`, `forward`, `url`). Pass `computerUseExtra: true` on the harness or agent to add the provider-neutral `computer_use_extra` tool — safe to leave on by default; it's a no-op for providers whose native tools already cover navigation. The Quick reference at the bottom shows it set. +Two opt-in tool flags round out the default computer-use set on either class: + +- **`computerUseExtra: true`** — adds `computer_use_extra` (`goto`, `back`, `forward`, `url`) for providers whose native vocab doesn't include navigation. Safe to leave on; it's inert for providers that already navigate. +- **`playwright: true`** — adds `playwright_execute` so the model can run Playwright/TS against the live browser for DOM-precise steps (form fills, structured extraction, `waitForSelector`). See "Mixing vision and DOM" below for when it's the right pick. + +Both are off by default. The Quick reference at the bottom shows them enabled. ## Browser provisioning @@ -175,7 +182,7 @@ The `browser.browser_live_view_url` field on the create response is the URL to s ## Adding your own tools -Pass any pi `AgentTool` (see [`@earendil-works/pi-agent-core`](https://www.npmjs.com/package/@earendil-works/pi-agent-core) for the tool shape) via `extraTools`. The CUA defaults stay installed; your tools run alongside them. +`extraTools` is for **domain-specific** tools the model needs — a database lookup, a payment API, an internal service call — not for exposing Playwright (that's what `playwright: true` is for). Pass any pi `AgentTool` (see [`@earendil-works/pi-agent-core`](https://www.npmjs.com/package/@earendil-works/pi-agent-core) for the tool shape); the CUA defaults stay installed and your tools run alongside them. ```ts import { z } from "zod"; @@ -234,18 +241,39 @@ await harness.prompt("Now click 'Settings' and read me the current value of X.") Profile saves on browser teardown, so future runs with the same profile name skip the manual login. -## Mixing vision and DOM (Playwright on the same browser) +## Mixing vision and DOM + +Vision (computer-use) and DOM (Playwright) are peer capabilities against the same browser. Real apps mix both — a login may be visual and bot-detected while the extraction on the other side is a stable table with reliable selectors. cua-agent supports two shapes for the mix depending on who you want making the call. -cua's strength is semantic, vision-driven interaction — describe what's on screen, the model finds it. Playwright's strength is deterministic DOM access — exact selectors, structured data extraction, file uploads, network interception. Real apps often need both, and the harness is built for it: you already hold the `browser.session_id`, so any Playwright snippet you ship through `client.browsers.exec` runs against the same browser the agent is driving. State (URL, cookies, storage) is shared. +Decision rubric — the same one that fits inside a system prompt if you want the model to internalize it: -Reach for Playwright on the cua browser when: +- **Reach for computer use when**: DOM is brittle or unknown, the target is canvas/video/pixel UI, bot detection makes human-like input matter, or the check is visual ("did the modal close?"). +- **Reach for Playwright when**: selectors are stable, you want structured extraction (`page.$$eval` over a list) instead of asking the model to read pixels, you're filling many form fields or hidden inputs, or you need to wait on a network/DOM condition rather than a visual cue. -- you need a **fixed selector** (form auto-fill, hidden inputs, file uploads, attribute reads). -- you want **structured extraction** (`page.$$eval` over a list) rather than asking the model to read pixels. -- you're driving a **cross-origin iframe** with a known DOM contract (payment widgets, SSO popups). -- you need to **wait on a network response or DOM condition** rather than a visual cue. +### Model-picked (`playwright: true`) -A common pattern — vision turn to navigate, DOM turn to extract, then back to vision: +Set `playwright: true` on the constructor and the model gets `playwright_execute` alongside its computer-use tools; it picks per action. + +```ts +const harness = new CuaAgentHarness({ + browser, client, session, + env: new NodeExecutionEnv({ cwd: process.cwd() }), + model: "openai:gpt-5.5", + playwright: true, +}); + +await harness.prompt("Log in to example.com, then return the last 10 orders as JSON."); +// The model uses computer_use for the login (visual, bot-detected) +// and playwright_execute for the extraction (structured, stable DOM). +``` + +Inside `playwright_execute`, `page`, `context`, and `browser` are in scope and the code may `return` a JSON-serializable value that comes back as the tool result. Each call runs in a fresh JS context (locals don't persist), but the browser session does (navigation, cookies, DOM state carry over). Screenshots aren't auto-attached — the model requests one on a follow-up turn when it needs to see the page. Playwright-level failures come back as tool content so the model can adapt rather than crash the turn. + +Verified end-to-end against Anthropic, Tzafon, and Yutori CUA models; OpenAI and Google are unit-tested. + +### Developer-scripted split + +If you'd rather orchestrate the split yourself instead of letting the model pick — for repeatable batch jobs, or when you know the DOM shape in advance — call `client.browsers.exec` against the same `browser.session_id`: ```ts await harness.prompt("Search for 'wireless headphones' and open the results page."); @@ -265,7 +293,7 @@ const products = await client.browsers.exec(browser.session_id, { await harness.prompt(`Click the cheapest product from this list: ${JSON.stringify(products)}`); ``` -You can also wire Playwright work in as an `AgentTool` (see "Adding your own tools") so the model itself decides when to switch modes — useful when "do I have a stable selector for this?" is part of the task, not a fixed plan. +Same browser, same state — pick this shape when the split is deterministic; pick `playwright: true` when the split depends on what the model finds mid-run. ## Debugging @@ -285,8 +313,10 @@ You can also wire Playwright work in as an `AgentTool` (see "Adding your own too - **You own the browser lifecycle.** Always tear down with `client.browsers.deleteByID(browser.session_id)` in a `finally` block — Kernel timeouts will reclaim eventually but profile state saves on close, not continuously. - **`setModel` is async.** It propagates through pi's snapshot machinery — `await` it before the next `prompt()`. - **Provider tool vocab gaps.** If a model can click and type but can't navigate, set `computerUseExtra: true` to add provider-neutral `goto` / `back` / `forward` / `url`. +- **`playwright: true` is off by default.** Turn it on to let the model pick DOM operations per action; leave it off if you want vision-only or you're orchestrating the DOM split yourself via `client.browsers.exec`. +- **`playwright_execute` runs each call in a fresh JS context.** Local variables don't persist across calls — persist state on `page`/`context` (cookies, storage, current URL) or by `return`ing values that the next prompt threads back in. - **`InMemorySessionRepo` is in-process only.** Reach for a persistent `SessionRepo` implementation if you need transcripts to survive restarts. -- **`extraTools` runs alongside CUA tools, not in place of them.** To replace the defaults, build the tool list with `createCuaComputerTools()` yourself. +- **`extraTools` runs alongside CUA tools, not in place of them.** Reserve it for domain-specific tools (DB, APIs, internal services); Playwright is `playwright: true`, not a hand-rolled `extraTools` entry. - **Stealth, headless, viewport, proxy** are all `browsers.create` flags — set them when provisioning, not on the harness. ## Quick reference @@ -308,7 +338,8 @@ const harness = new CuaAgentHarness({ browser, client, session, env: new NodeExecutionEnv({ cwd: process.cwd() }), model: "openai:gpt-5.5", - computerUseExtra: true, + computerUseExtra: true, // provider-neutral goto/back/forward/url + playwright: true, // let the model pick DOM operations per action }); harness.subscribe((event) => { /* ... */ }); From 6632bd819892439c71b237fd805b5c28c2c8753b Mon Sep 17 00:00:00 2001 From: dprevoznik <58714078+dprevoznik@users.noreply.github.com> Date: Thu, 2 Jul 2026 01:21:22 +0000 Subject: [PATCH 9/9] =?UTF-8?q?fix(cua-agent):=20YAML=20frontmatter=20?= =?UTF-8?q?=E2=80=94=20inline=20`playwright:=20true`=20parsed=20as=20a=20n?= =?UTF-8?q?ested=20mapping=20key?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Reword the description to avoid the `word: word` colon-space token inside the unquoted YAML string. Now uses "the built-in `playwright_execute` tool" instead. --- plugins/kernel-sdks/skills/cua-agent/SKILL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/plugins/kernel-sdks/skills/cua-agent/SKILL.md b/plugins/kernel-sdks/skills/cua-agent/SKILL.md index 9858479..9dfde7b 100644 --- a/plugins/kernel-sdks/skills/cua-agent/SKILL.md +++ b/plugins/kernel-sdks/skills/cua-agent/SKILL.md @@ -1,6 +1,6 @@ --- name: cua-agent -description: Build TypeScript apps that embed Kernel's computer-use loop with `@onkernel/cua-agent` — `CuaAgent` and `CuaAgentHarness` classes drive a Kernel cloud browser via prompt → screenshot → tool-call loops across OpenAI, Anthropic, Google, and Yutori provider tools. Use when writing TS code that needs computer-use against a Kernel browser, swapping providers mid-session, adding your own pi tools alongside computer use, mixing vision with `playwright_execute` (via `playwright: true`) so the model picks DOM or vision per action, or hooking into the agent event stream. For shell-callable cua, see `cua-cli`. +description: Build TypeScript apps that embed Kernel's computer-use loop with `@onkernel/cua-agent` — `CuaAgent` and `CuaAgentHarness` classes drive a Kernel cloud browser via prompt → screenshot → tool-call loops across OpenAI, Anthropic, Google, and Yutori provider tools. Use when writing TS code that needs computer-use against a Kernel browser, swapping providers mid-session, adding your own pi tools alongside computer use, mixing vision with the built-in `playwright_execute` tool so the model picks DOM or vision per action, or hooking into the agent event stream. For shell-callable cua, see `cua-cli`. --- # cua-agent