Desktop Control Default Active
Computer Use desktop automation — get_app_state, click, set_value, select_text, keyboard/pointer actions. Unified skill for all UI automation that routes between CDP (browser DOM) and Computer Use (desktop apps) based on the target.
browser and vision-click skills, absorbing their functionality into a unified routing system.Quick Reference
Execution Paths
| Path | When | Speed | Example |
|---|---|---|---|
| CDP | Target is web DOM (page elements, forms, buttons) | ~120 ms/action | Click a button on naver.com |
| Computer Use | Desktop apps, Chrome chrome, OS dialogs, shortcuts | ~1200 ms/action | Open Finder, press Cmd+Tab |
| CDP + CU | DOM lookup + pixel click (Canvas, iframe, WebGL) | Variable | Click a map label with no DOM ref |
Action Classes
| Class | Contract | Description | Example |
|---|---|---|---|
state-read | CU-00 / TX-00 | Read current UI state | get_app_state("Finder"), snapshot --interactive |
element-action | CU-01 / TX-01 | Click/interact by element ref | click(element_index=730), click e3 |
value-injection | CU-02 / TX-02 | Set field value directly | set_value(element_index=12, value="text") |
keyboard-action | CU-03 | Key press or combo | press_key(key="super+Tab") |
pointer-action | CU-04 | Raw coordinate click | click(x=812, y=514) |
pointer-action+vision | CU-05 | Vision-guided coordinate click | Vision lookup "Play button" then click(x,y) |
stale-recovery | CU-06 | Re-read after stale warning | Re-call get_app_state on stale |
secondary-action | CU-10 | Accessibility secondary action | perform_secondary_action(el=44, action="AXShowMenu") |
scroll-action | CU-11 | Scroll a scrollable element | scroll(element_index=9, direction="down") |
drag-action | CU-12 | Drag by coordinates | drag(from_x=100, from_y=100, to_x=400, to_y=100) |
Intent Routing
The skill decides the execution path before acting. The first line of every task must announce path=cdp, path=computer-use, or path=cdp+cu.
Decision Table
| Target | User Intent Example | Path | Action Class |
|---|---|---|---|
| Web page DOM | "open naver.com and search foo" | CDP | element-action |
| Web page DOM (read) | "what's the headline on this page?" | CDP | state-read |
| Desktop app window | "open Finder → Downloads" | CU | element-action |
| Chrome chrome (non-DOM) | "switch to the tab on the right" | CU | element-action |
| Native dialog input | "type into the macOS dialog" | CU | value-injection |
| Global shortcut | "press Command-Tab" | CU | keyboard-action |
| Arbitrary pixel | "click at (812, 514)" | CU | pointer-action |
| Canvas / iframe / WebGL | "click the Play button (no DOM ref)" | CDP+CU | pointer-action+vision |
| DOM lookup + pointer click | "find button via DOM, click with cursor" | CDP+CU | element-action → pointer-action |
Resolution Order
- Explicit opt-in: Message contains
$computer-useor/computer-use→ Computer Use path immediately. If tools unavailable, stop withprecondition failed: computer-use unavailable. - Can the target be addressed by
cli-jaw browser snapshot --interactiveref? → CDP. - Non-DOM web widget (Canvas, WebGL, iframe) visible in
get_app_statescreenshot? → Computer Useclick(x, y). - Target outside any webpage (app window, menu bar, OS dialog)? → Computer Use.
- Pixel coordinate the user gave verbatim? → Computer Use pointer-action.
- Need DOM to locate but user insists on real cursor click? → Hybrid.
- App name unclear? →
list_apps()beforeget_app_state(app). - None matched → stop and report
needs boss follow-up: ambiguous target.
Absolute Rules
These rules are non-negotiable. Violating any of them produces incorrect transcripts or wasted actions.
- Announce the path before acting. First line of every task must be
path=cdp,path=computer-use, orpath=cdp+cu. - State-first rule. Computer Use always starts each assistant turn with
get_app_state(app)before interacting. Re-call it on stale warnings, after UI-changing actions, and whenever confidence drops. - Every meaningful action records an
action_class. Classes:state-read,element-action,value-injection,keyboard-action,pointer-action,pointer-action+vision. - Never fall back silently. If the required path is unavailable, stop and report which precondition failed.
- Never claim the cursor was visible. Cursor overlay is best-effort in the current build.
- When uncertain, take a screenshot FIRST. Never chain actions through uncertainty. If two consecutive actions produced ambiguous state, the next call must be a state-read.
get_app_state(app) before any next action. Guessing indices leads to infinite correction loops. One extra state-read always beats one wrong click.Computer Use Tool Surface
The Computer Use path exposes these tools through the active runtime:
| Tool | Signature | Use |
|---|---|---|
list_apps() | — | Discover available/recent apps when name is unknown |
get_app_state | (app) | Start or refresh session; returns screenshot + accessibility tree. App may be a display name, bundle id, or full path. |
click | (app, element_index, click_count?, mouse_button?) or (app, x, y, click_count?, mouse_button?) | Click by element ref or raw coordinates |
drag | (app, from_x, from_y, to_x, to_y) | Drag by coordinates |
press_key | (app, key) | Key or combo: Return, Tab, super+c |
scroll | (app, element_index, direction, pages) | Scroll a scrollable accessibility element |
set_value | (app, element_index, value) | Set an editable accessibility element's value |
select_text | (app, element_index, text, selection?, prefix?, suffix?) | Select exact text or place the cursor before/after it inside a text element |
type_text | (app, text) | Type literal text into current focus (verify focus first!) |
perform_secondary_action | (app, element_index, action) | Invoke a secondary accessibility action |
set_value over type_text. type_text(app, text) fires keystrokes at whatever has focus — fragile unless the latest state proves the cursor is in the intended field. set_value(app, element_index, value) targets a specific element — deterministic and reliable.CDP Path — cli-jaw browser
For anything addressable by Chrome DOM: page text, form fields, buttons inside the page. 10x faster than Computer Use for web content.
Session Commands
cli-jaw browser status # is a session active?
cli-jaw browser start --agent # headless automation session
cli-jaw browser start --headless # manual headless (WSL/CI/Docker)
cli-jaw browser start # visible window (only if user asks)
cli-jaw browser navigate "https://example.com" # go to URL
cli-jaw browser open "https://example.com" # open in new tab
cli-jaw browser tabs # list tabs
Snapshot + Act
cli-jaw browser snapshot --interactive # list clickable elements (e1, e2, ...)
cli-jaw browser click e3 # click by ref
cli-jaw browser type e5 "query" --submit # type + Enter
cli-jaw browser press Enter # fallback key press
cli-jaw browser screenshot # save a PNG
cli-jaw browser text # dump visible text
CDP Action Class Mapping
| Command | Action Class | Contract |
|---|---|---|
snapshot, text, screenshot | state-read | TX-00 |
click, type ... --submit | element-action | TX-01 |
type <ref> "value" (no submit) | value-injection | TX-02 |
press Enter / press Escape | keyboard-action | — |
Preconditions
Computer Use Path
- Platform: macOS only.
- Active agent runtime must expose all Computer Use tools (
list_apps,get_app_state,click,drag,press_key,scroll,select_text,set_value,type_text,perform_secondary_action). - TCC Accessibility and AppleEvents must be granted to the controlling app.
- In packaged installs,
/Applications/Jaw.appand/Applications/Codex Computer Use.appmay be required for TCC attribution. - If app name is unknown, call
list_apps()first.
CDP Path
cli-jaw servemust be running (check withcli-jaw browser status).- A browser session must be active (start with
cli-jaw browser start --agent).
precondition failed: <name> and stop. Never silently switch paths.Transcript Format
Every action is logged in a structured transcript. The boss parses these — do not omit the path= line.
CDP Transcript
path=cdp
url=https://example.com
action=click e3
result=ok
Computer Use Transcript
path=computer-use
app=Google Chrome
action_class=element-action
action=click(element_index=730)
stale_warning=no
result=ok
Hybrid Transcript
path=cdp+cu
lookup=cli-jaw browser snapshot -> bbox of "Play"
action_class=pointer-action
action=click(x=812, y=514)
result=ok
Error Transcript
path=computer-use
app=Finder
action_class=element-action
action=click(element_index=34)
stale_warning=yes
result=error: stale element tree
Stale Recovery Pattern
When stale_warning=yes is returned, the element tree has changed. Follow this exact sequence:
get_app_state(app)— re-ground; note the real element_index of your target.- Log
action_class=stale-recovery(CU-06) with a one-line note (reason=disambiguation). - Re-issue the intended action using the fresh element_index.
- After that action,
get_app_state(app)once more to confirm the effect.
# Stale warning received
path=computer-use
app=Google Chrome
action_class=stale-recovery
action=get_app_state("Google Chrome")
stale_warning=no
result=ok (52 elements -- tree changed)
# Retry with fresh index
path=computer-use
app=Google Chrome
action_class=element-action
action=click(element_index=38) # same target, new index
stale_warning=no
result=ok
Vision-Click Integration
Vision-click is one specific tactic inside the broader UI targeting problem. It is legacy and generally superseded by direct coordinate clicking from screenshots.
Decision Order
- Did
cli-jaw browser snapshot --interactivereturn a ref? → CDPclick. - Is Computer Use available and target visible in screenshot? →
click(x, y)pointer-action directly (preferred). - No ref and direct coordinate unsuitable →
cli-jaw browser vision-click(Codex-only, legacy).
# CDP vision-click (inside Chrome, Codex-only)
cli-jaw browser vision-click "Submit button"
cli-jaw browser vision-click "Play button" --double
# Desktop vision-click (CU-05)
# 1. get_app_state(app) to capture screenshot
# 2. Vision model identifies coordinates
# 3. click(x=vx, y=vy) via Computer Use
element_index, use that instead.Worked Example: Chrome → Spotify Trace
A real end-to-end trace demonstrating state-first, element targeting, stale recovery, and the CDP speed switch.
1. State First
path=computer-use
app=Google Chrome
action_class=state-read
action=get_app_state("Google Chrome")
stale_warning=no
result=ok (47 elements, focused tab: open.spotify.com)
2. Element-Index Targeting (not focus-only typing)
# Click the search input field
path=computer-use
app=Google Chrome
action_class=element-action
action=click(element_index=12)
stale_warning=no
result=ok
# Set search value deterministically
path=computer-use
app=Google Chrome
action_class=value-injection
action=set_value(element_index=12, value="Daft Punk")
stale_warning=no
result=ok
3. Stale Warning → Recovery
# Search results load asynchronously -- tree changes
path=computer-use
app=Google Chrome
action_class=element-action
action=click(element_index=34)
stale_warning=yes
result=error: stale element tree
# Correct recovery
path=computer-use
app=Google Chrome
action_class=stale-recovery
action=get_app_state("Google Chrome")
result=ok (52 elements -- tree changed)
path=computer-use
app=Google Chrome
action_class=element-action
action=click(element_index=38) # same result, new index
stale_warning=no
result=ok
4. Switch to CDP for 10x Speed
# Realize all targets are DOM nodes -- swap to CDP
path=cdp
url=https://open.spotify.com
action=cli-jaw browser snapshot --interactive
result=ok (ref IDs: e1..e89)
path=cdp
url=https://open.spotify.com
action=click e42 # "Shuffle Play" button
result=ok
Common Failures & Correct Responses
| Symptom | Correct Report |
|---|---|
| "I don't see a cursor" | cursor overlay is best-effort in the current build -- action=click(...) succeeded; visible cursor not guaranteed |
| CDP server not running | precondition failed: cli-jaw serve not running. Start with 'jaw serve' and retry. |
| Computer Use tools missing | precondition failed: computer-use unavailable |
| cli-jaw CU app missing (packaged install) | precondition failed: /Applications/Codex Computer Use.app missing. Recover: jaw doctor --tcc --fix |
| Stale warning on action | Re-call get_app_state(app) then retry; log stale_warning=yes |
| No ref found for target | If visible in screenshot, use Computer Use click(x, y). Otherwise report gap. |
| Navigation drift between snapshot and click | Re-snapshot and retry once; if still off, report. |
| CDP session died mid-task | precondition failed: cdp session terminated |
| Vision model returns "no match" | vision lookup failed for "<query>" -- suggest a more specific description |
| Non-GUI task routed here | needs boss follow-up: not GUI automation |
Contract ID Reference
All routing maps to capability contracts defined in the internal Computer Use spec. Use these IDs in transcripts when a contract name is needed.
| Path | Contract | Description |
|---|---|---|
| CDP state-read | TX-00 | Read page state via snapshot/text/screenshot |
| CDP element-action | TX-01 | Click/type by DOM ref |
| CDP value-injection | TX-02 | Set field value without submit |
| CDP hybrid-lookup | TX-03 | DOM lookup for hybrid path |
| CDP vision-lookup | TX-04 | Vision-based coordinate lookup (Codex-only) |
| CDP error-report | TX-05 | Report CDP errors |
| CU state-read | CU-00 | get_app_state screenshot + tree |
| CU element-action | CU-01 | Click by element_index |
| CU value-injection | CU-02 | set_value / type_text |
| CU keyboard-action | CU-03 | press_key combos |
| CU pointer-action | CU-04 | Raw coordinate click |
| CU pointer-action+vision | CU-05 | Vision-guided coordinate click |
| CU stale-recovery | CU-06 | Re-read after stale warning |
| CU precondition-fail | CU-07 | Report precondition failure |
| CU confirmation-prompt | CU-08 | Ask before destructive action |
| CU transcript-summary | CU-09 | After-action summary for boss |
| CU secondary-action | CU-10 | Accessibility secondary actions |
| CU scroll-action | CU-11 | Scroll a scrollable element |
| CU drag-action | CU-12 | Drag by coordinates |
"~해줘" Usage Examples
Natural Korean requests that trigger the desktop-control skill. The routing system understands both Korean and English.
"네이버에서 '서울 날씨' 검색해줘"
Path: CDP → navigate to naver.com → snapshot --interactive → type into search field → click submit
"Finder에서 다운로드 폴더 열어줘"
Path: CU → get_app_state("Finder") → click(element_index=N) on Downloads → verify with get_app_state
"크롬 탭 Gmail로 전환해줘"
Path: CU → get_app_state("Google Chrome") → find Gmail tab in accessibility tree → click(element_index=N)
"네이버 지도에서 '스타벅스' 라벨 클릭해줘"
Path: CDP+CU → get_app_state screenshot shows label but not in element tree → click(x=719, y=388) pointer-action from screenshot coords
"시스템 설정에서 다크 모드 켜줘"
Path: CU → list_apps() → get_app_state("System Settings") → navigate to Appearance → click Dark mode option
Rules: Never Do
- Do not try the wrong path once "just in case" — it burns TCC prompts and confuses logs.
- Do not silently upgrade from
state-readtoelement-actionwithout updating the action class. - Do not omit the
path=line. The boss parses it. - Do not use
curl/wgetto "read a page" — always use CDP. - Do not open a visible browser window just to inspect logs — use the Web UI debug console.
- Do not try coordinate-based click when a ref exists — use the ref.
- Do not assume the cursor is visible. Report action success/failure only.
- Do not skip
get_app_statebecause "you remember where the button is." Element indices drift with every state change. - Do not run two Computer Use actions without re-reading state between them if the previous action changed anything.
- Do not resolve uncertainty by trial and error. "Let me click 342, if not, try 357" is forbidden — take a screenshot.
- Do not use
type_text(app, text)as a shortcut for targeted form entry unless focus was verified in the latest state. - Do not retry vision-click with rephrasings — one attempt per call; report and stop on failure.
Related Skills
| Skill | Relationship |
|---|---|
browser | CDP command reference — this skill supersedes its coverage |
screen-capture | Generic macOS screenshot / webcam / video recording (unchanged, separate concern) |
vision-click | No longer auto-active; absorbed as a tactic. Install with cli-jaw skill install vision-click for the low-level recipe (NDJSON parsing, DPR correction) |