Skip to content

feat(computer): native macOS computer-use tool (see + visible input + background AX control + scripting)#345

Merged
1jehuang merged 4 commits into
masterfrom
feat/macos-computer-tool
Jun 9, 2026
Merged

feat(computer): native macOS computer-use tool (see + visible input + background AX control + scripting)#345
1jehuang merged 4 commits into
masterfrom
feat/macos-computer-tool

Conversation

@1jehuang

@1jehuang 1jehuang commented Jun 8, 2026

Copy link
Copy Markdown
Owner

What

Native macOS computer tool: the desktop analog of browser. One
action-dispatched tool that lets the agent see the screen and control the
macOS GUI, with background-capable mechanisms preferred over disrupting the
user's screen.

Closes #340. Design docs: docs/proposals/computer-use-tool.md and
docs/proposals/computer-use-maximal-control.md.

Capabilities

See: screenshot (Retina point/pixel scale aware), window_screenshot
(per-window, even occluded), ocr (Vision via a Swift bridge), ui (AX tree
with element paths).

Visible coordinate input (CGEvent): move, click, double_click,
right_click, drag, scroll, type, key chords, key_down/key_up,
cursor.

Background control via Accessibility (no cursor movement, app need not be
frontmost):
find_element, element_at, press (AXPress), set_value,
get_value, perform_action, select_menu.

Windows/apps: list_apps, list_windows, activate_app, hide_app,
quit_app, focus_window, move_window, resize_window, minimize_window,
close_window.

Clipboard/scripting/system: get_clipboard, set_clipboard,
run_applescript, run_jxa (headless scripting), wait_for (AX poll),
notify, system_state, set_brightness.

Setup: check_permissions and setup (request + deep-link to the exact
System Settings panes + poll Accessibility until granted).

Interface: one tool, progressive disclosure

To keep always-on prompt cost flat regardless of how many actions exist, the
schema describes only the common actions plus a discover action that returns
full specs per category (mouse|keyboard|observe|ax|windows|apps|clipboard| scripting|system|setup|all) on demand. Measured always-on size stays ~700
chars-estimate tokens (guarded by a schema_is_compact unit test), versus
~1,020 for a fully-flat schema of the same ~46 actions.

Default policy (documented in discover output): prefer background AX/scripting
over visible coordinate input when the target element is resolvable; fall back to
click/type only when AX can't reach it.

Design notes

  • Synthetic input: Core Graphics CGEvent (core-graphics is a direct macOS dep,
    highsierra feature for scroll).
  • AX read/act + windows + scripting: funneled through one osascript/JXA runner
    (osa.rs) that maps the TCC permission errors (-1719 Accessibility, -1743
    Automation) to actionable messages.
  • Screenshots: screencapture; pixel size read from the PNG IHDR (because
    CGDisplay::pixels_wide can report points on Retina).
  • OCR: Vision has no scripting bridge, so it shells to swift running an inline
    VNRecognizeTextRequest; returns strings with normalized bounding boxes.
  • Module layout: osa, keys, input, screen, ax, win, sys, setup, discover +
    mod dispatch. Gated behind cfg(target_os = "macos").
  • Hard limit documented: the Accessibility toggle cannot be flipped by any API
    (Apple anti-malware boundary); setup gets the user one click away.

Testing

Unit tests (CI-safe on macOS): keycode mapping, AppleScript escaping, PNG IHDR
parsing, discovery scoping, schema_is_compact size guard, input validation.

Live tests (#[ignore]; need GUI + permissions), all pass on an M-series Retina
Mac:

cargo test -p jcode-app-core tool::computer::tests::live -- --ignored
  • check_permissions, cursor+move (lands on exact point), screenshot (2.00x
    scale), ui tree, list_windows, clipboard roundtrip, run_applescript, and
    live_background_set_value: sets a TextEdit field via AX while TextEdit is
    NOT frontmost, proving background control with no cursor movement.

Also validated end-to-end through a real model session on a freshly built binary
(jcode run --tools computer): the model called discover, check_permissions,
and screenshot and got correct results.

Notes / tradeoffs

  • Per CONTRIBUTING, treat as a reference implementation; happy to adjust scope.
  • AX targeting uses System Events element paths; a future iteration could use
    native AXUIElement handles for sturdier addressing.
  • Camera/microphone capture intentionally excluded.
  • A signed jcode.app bundle (for durable TCC grants across updates) and a
    virtual-display path for true parallel/non-interfering work are noted as
    follow-ups in the roadmap.

jeremy added 2 commits June 7, 2026 23:28
Add a macOS-only `computer` tool that lets the agent control the desktop
GUI directly: screenshot, move/click/double-click/right-click/drag the
mouse, type text, press key chords, scroll, read the focused app's
Accessibility (UI) tree, report the cursor position, and check
Accessibility/Screen-Recording permissions.

Synthetic input uses Core Graphics CGEvents (core-graphics is already a
transitive dep; the highsierra feature enables scroll). Screenshots shell
out to /usr/sbin/screencapture and report the Retina point/pixel scale so
click coordinates (points) line up with what the model sees (pixels). The
UI tree is read via System Events (osascript).

The tool is gated behind cfg(target_os = "macos") and is the desktop
analog of the existing browser tool. Closes #340.
- The osascript UI-tree handler used the reserved word `line` and referenced
  System Events terminology (`UI elements`) outside a tell block, producing a
  -2741 syntax error. Wrap the handler in `using terms from application
  "System Events"` and rename `line` to `ln`.
- CGDisplay::pixels_wide can report points (not pixels) on Retina, giving a
  wrong 1.00x scale. Read true pixel dimensions from the PNG IHDR header so the
  reported point/pixel scale is correct (e.g. 2.00x), letting the model convert
  image pixel coordinates to click points. Add a png_dimensions unit test plus
  ignored live tests for cursor/move/screenshot/ui/permissions.
Restructure the computer tool into a directory module and expand it from
visible coordinate control into full macOS control, with background-capable
mechanisms preferred over disrupting the user's screen.

New capabilities:
- Tier 1 AX (background, no cursor): find_element, element_at, press,
  set_value, get_value, perform_action, select_menu. Verified end-to-end:
  set a TextEdit field via AX while it was NOT frontmost (no cursor move).
- Tier 2 windows/apps: list_apps/windows, activate/hide/quit_app,
  focus/move/resize/minimize/close_window, window_screenshot.
- Tier 3/4: clipboard get/set, run_applescript/run_jxa (headless scripting),
  wait_for (AX poll), notify, system_state, set_brightness, key_down/key_up.
- Tier 5: ocr via Vision (Swift bridge), per-window capture.
- setup/check_permissions: report + request + deep-link to the exact System
  Settings panes + poll Accessibility until granted (the one toggle macOS
  won't let any API flip).

Interface: one action-dispatched  tool with progressive disclosure.
The always-on schema describes only common actions + a  action that
returns full specs per category on demand, so the base prompt cost stays
~700 chars-est tokens regardless of how many actions exist (guarded by a
schema_is_compact test). Default policy documented: prefer background
AX/scripting over visible coordinate input when the target is resolvable.

Modules: osa (osascript/JXA + TCC error mapping), keys, input (CGEvent),
screen (capture/OCR), ax, win, sys, setup, discover. Unit tests for parsing/
discovery/schema size; ignored live tests for the full surface (all pass on a
Retina Mac). Roadmap in docs/proposals/computer-use-maximal-control.md.

Refs #340.
@1jehuang 1jehuang changed the title feat(computer): native macOS computer-use tool (screenshot/click/type/key/ui) feat(computer): native macOS computer-use tool (see + visible input + background AX control + scripting) Jun 9, 2026
… bug fixes

Exhaustive live coverage of all 42 actions surfaced and fixed three bugs:
- key_down/key_up rejected modifier-only holds (e.g. "shift"); now emit a
  FlagsChanged event with the modifier keycode.
- select_menu failed with 'Can't get menu bar 1'; address menu bar of the
  target process and activate it first. Verified it actually clicks the item.
- quit_app hung indefinitely when an app showed a modal (unsaved-changes
  sheet); now bounded and reports the dialog instead of freezing.

Robustness:
- All external commands (osascript/JXA/screencapture) run under a wall-clock
  timeout via osa::run_command_timed, so an unresponsive app can never freeze
  the agent. AX action verbs use a 10s timeout; quit uses 8s.
- dry_run param: mutating actions report intent without acting (is_mutating
  classifier + gate).
- cap_output truncates large textual results (ui tree / clipboard / ocr) at
  16k chars to protect context; images unaffected.
- Audited temp-file cleanup (all screenshot/OCR temp files removed).

Tests: 20 unit (incl. timeout, dry_run, cap, modifier-chord, is_mutating) +
an exhaustive coverage_tests suite that exercises every action live (8 suites,
all pass on a Retina Mac). Clippy clean.

Refs #340.
@1jehuang

1jehuang commented Jun 9, 2026

Copy link
Copy Markdown
Owner Author

Production hardening pass

Did an exhaustive live coverage test of all 42 actions (new coverage_tests suite, 8 groups, all pass on a Retina Mac). It surfaced and fixed 3 real bugs:

  1. key_down/key_up rejected modifier-only holds (e.g. shift) — now emits a FlagsChanged event with the modifier keycode.
  2. select_menu failed with "Can't get menu bar 1" — now addresses the target process's menu bar and activates it first (verified it actually clicks the item).
  3. quit_app hung indefinitely on a modal (unsaved-changes sheet) — now bounded with a clear message.

Robustness added:

  • Timeouts on every external command (osascript/JXA/screencapture) via run_command_timed — an unresponsive app can never freeze the agent. (unit-tested)
  • dry_run param: mutating actions report intent without acting (verified live through a model session).
  • Output caps (16k chars) on ui-tree/clipboard/ocr to protect context; images unaffected.
  • Temp-file cleanup audited (all capture temp files removed).

Tests now: 20 unit + 8 original live + 8 coverage suites; clippy clean. Validated end-to-end on a freshly built binary via jcode run --tools computer: dry_run, background AX set_value+get_value (TextEdit updated while not frontmost), screenshot, discover all behave correctly.

@1jehuang 1jehuang merged commit 7d193e7 into master Jun 9, 2026
5 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Native macOS computer-use tool (Accessibility + CGEvent: screenshot/click/type/key/ui)

1 participant