feat(computer): native macOS computer-use tool (see + visible input + background AX control + scripting)#345
Merged
Merged
Conversation
added 2 commits
June 7, 2026 23:28
Add a macOS-only `computer` tool that lets the agent control the desktop GUI directly: screenshot, move/click/double-click/right-click/drag the mouse, type text, press key chords, scroll, read the focused app's Accessibility (UI) tree, report the cursor position, and check Accessibility/Screen-Recording permissions. Synthetic input uses Core Graphics CGEvents (core-graphics is already a transitive dep; the highsierra feature enables scroll). Screenshots shell out to /usr/sbin/screencapture and report the Retina point/pixel scale so click coordinates (points) line up with what the model sees (pixels). The UI tree is read via System Events (osascript). The tool is gated behind cfg(target_os = "macos") and is the desktop analog of the existing browser tool. Closes #340.
- The osascript UI-tree handler used the reserved word `line` and referenced System Events terminology (`UI elements`) outside a tell block, producing a -2741 syntax error. Wrap the handler in `using terms from application "System Events"` and rename `line` to `ln`. - CGDisplay::pixels_wide can report points (not pixels) on Retina, giving a wrong 1.00x scale. Read true pixel dimensions from the PNG IHDR header so the reported point/pixel scale is correct (e.g. 2.00x), letting the model convert image pixel coordinates to click points. Add a png_dimensions unit test plus ignored live tests for cursor/move/screenshot/ui/permissions.
Restructure the computer tool into a directory module and expand it from visible coordinate control into full macOS control, with background-capable mechanisms preferred over disrupting the user's screen. New capabilities: - Tier 1 AX (background, no cursor): find_element, element_at, press, set_value, get_value, perform_action, select_menu. Verified end-to-end: set a TextEdit field via AX while it was NOT frontmost (no cursor move). - Tier 2 windows/apps: list_apps/windows, activate/hide/quit_app, focus/move/resize/minimize/close_window, window_screenshot. - Tier 3/4: clipboard get/set, run_applescript/run_jxa (headless scripting), wait_for (AX poll), notify, system_state, set_brightness, key_down/key_up. - Tier 5: ocr via Vision (Swift bridge), per-window capture. - setup/check_permissions: report + request + deep-link to the exact System Settings panes + poll Accessibility until granted (the one toggle macOS won't let any API flip). Interface: one action-dispatched tool with progressive disclosure. The always-on schema describes only common actions + a action that returns full specs per category on demand, so the base prompt cost stays ~700 chars-est tokens regardless of how many actions exist (guarded by a schema_is_compact test). Default policy documented: prefer background AX/scripting over visible coordinate input when the target is resolvable. Modules: osa (osascript/JXA + TCC error mapping), keys, input (CGEvent), screen (capture/OCR), ax, win, sys, setup, discover. Unit tests for parsing/ discovery/schema size; ignored live tests for the full surface (all pass on a Retina Mac). Roadmap in docs/proposals/computer-use-maximal-control.md. Refs #340.
… bug fixes Exhaustive live coverage of all 42 actions surfaced and fixed three bugs: - key_down/key_up rejected modifier-only holds (e.g. "shift"); now emit a FlagsChanged event with the modifier keycode. - select_menu failed with 'Can't get menu bar 1'; address menu bar of the target process and activate it first. Verified it actually clicks the item. - quit_app hung indefinitely when an app showed a modal (unsaved-changes sheet); now bounded and reports the dialog instead of freezing. Robustness: - All external commands (osascript/JXA/screencapture) run under a wall-clock timeout via osa::run_command_timed, so an unresponsive app can never freeze the agent. AX action verbs use a 10s timeout; quit uses 8s. - dry_run param: mutating actions report intent without acting (is_mutating classifier + gate). - cap_output truncates large textual results (ui tree / clipboard / ocr) at 16k chars to protect context; images unaffected. - Audited temp-file cleanup (all screenshot/OCR temp files removed). Tests: 20 unit (incl. timeout, dry_run, cap, modifier-chord, is_mutating) + an exhaustive coverage_tests suite that exercises every action live (8 suites, all pass on a Retina Mac). Clippy clean. Refs #340.
Owner
Author
Production hardening passDid an exhaustive live coverage test of all 42 actions (new
Robustness added:
Tests now: 20 unit + 8 original live + 8 coverage suites; clippy clean. Validated end-to-end on a freshly built binary via |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Native macOS
computertool: the desktop analog ofbrowser. Oneaction-dispatched tool that lets the agent see the screen and control themacOS GUI, with background-capable mechanisms preferred over disrupting the
user's screen.
Closes #340. Design docs:
docs/proposals/computer-use-tool.mdanddocs/proposals/computer-use-maximal-control.md.Capabilities
See:
screenshot(Retina point/pixel scale aware),window_screenshot(per-window, even occluded),
ocr(Vision via a Swift bridge),ui(AX treewith element paths).
Visible coordinate input (CGEvent):
move,click,double_click,right_click,drag,scroll,type,keychords,key_down/key_up,cursor.Background control via Accessibility (no cursor movement, app need not be
frontmost):
find_element,element_at,press(AXPress),set_value,get_value,perform_action,select_menu.Windows/apps:
list_apps,list_windows,activate_app,hide_app,quit_app,focus_window,move_window,resize_window,minimize_window,close_window.Clipboard/scripting/system:
get_clipboard,set_clipboard,run_applescript,run_jxa(headless scripting),wait_for(AX poll),notify,system_state,set_brightness.Setup:
check_permissionsandsetup(request + deep-link to the exactSystem Settings panes + poll Accessibility until granted).
Interface: one tool, progressive disclosure
To keep always-on prompt cost flat regardless of how many actions exist, the
schema describes only the common actions plus a
discoveraction that returnsfull specs per category (
mouse|keyboard|observe|ax|windows|apps|clipboard| scripting|system|setup|all) on demand. Measured always-on size stays ~700chars-estimate tokens (guarded by a
schema_is_compactunit test), versus~1,020 for a fully-flat schema of the same ~46 actions.
Default policy (documented in
discoveroutput): prefer background AX/scriptingover visible coordinate input when the target element is resolvable; fall back to
click/type only when AX can't reach it.
Design notes
CGEvent(core-graphics is a direct macOS dep,highsierrafeature for scroll).osascript/JXA runner(
osa.rs) that maps the TCC permission errors (-1719 Accessibility, -1743Automation) to actionable messages.
screencapture; pixel size read from the PNG IHDR (becauseCGDisplay::pixels_widecan report points on Retina).swiftrunning an inlineVNRecognizeTextRequest; returns strings with normalized bounding boxes.osa, keys, input, screen, ax, win, sys, setup, discover+moddispatch. Gated behindcfg(target_os = "macos").(Apple anti-malware boundary);
setupgets the user one click away.Testing
Unit tests (CI-safe on macOS): keycode mapping, AppleScript escaping, PNG IHDR
parsing, discovery scoping,
schema_is_compactsize guard, input validation.Live tests (
#[ignore]; need GUI + permissions), all pass on an M-series RetinaMac:
scale), ui tree, list_windows, clipboard roundtrip, run_applescript, and
live_background_set_value: sets a TextEdit field via AX while TextEdit isNOT frontmost, proving background control with no cursor movement.
Also validated end-to-end through a real model session on a freshly built binary
(
jcode run --tools computer): the model calleddiscover,check_permissions,and
screenshotand got correct results.Notes / tradeoffs
native
AXUIElementhandles for sturdier addressing.virtual-display path for true parallel/non-interfering work are noted as
follow-ups in the roadmap.