Desktop Control Default Active

Computer Use desktop automation — get_app_state, click, set_value, select_text, keyboard/pointer actions. Unified skill for all UI automation that routes between CDP (browser DOM) and Computer Use (desktop apps) based on the target.

Skill ID: desktop-control
Type: Auto-active
Platform: macOS
Requires: cli-jaw, Google Chrome
Optional: cliclick (brew)
This skill is always injected into the system prompt. You never need to load it manually. It supersedes the standalone browser and vision-click skills, absorbing their functionality into a unified routing system.

Quick Reference

Execution Paths

PathWhenSpeedExample
CDPTarget is web DOM (page elements, forms, buttons)~120 ms/actionClick a button on naver.com
Computer UseDesktop apps, Chrome chrome, OS dialogs, shortcuts~1200 ms/actionOpen Finder, press Cmd+Tab
CDP + CUDOM lookup + pixel click (Canvas, iframe, WebGL)VariableClick a map label with no DOM ref

Action Classes

ClassContractDescriptionExample
state-readCU-00 / TX-00Read current UI stateget_app_state("Finder"), snapshot --interactive
element-actionCU-01 / TX-01Click/interact by element refclick(element_index=730), click e3
value-injectionCU-02 / TX-02Set field value directlyset_value(element_index=12, value="text")
keyboard-actionCU-03Key press or combopress_key(key="super+Tab")
pointer-actionCU-04Raw coordinate clickclick(x=812, y=514)
pointer-action+visionCU-05Vision-guided coordinate clickVision lookup "Play button" then click(x,y)
stale-recoveryCU-06Re-read after stale warningRe-call get_app_state on stale
secondary-actionCU-10Accessibility secondary actionperform_secondary_action(el=44, action="AXShowMenu")
scroll-actionCU-11Scroll a scrollable elementscroll(element_index=9, direction="down")
drag-actionCU-12Drag by coordinatesdrag(from_x=100, from_y=100, to_x=400, to_y=100)

Intent Routing

The skill decides the execution path before acting. The first line of every task must announce path=cdp, path=computer-use, or path=cdp+cu.

Decision Table

TargetUser Intent ExamplePathAction Class
Web page DOM"open naver.com and search foo"CDPelement-action
Web page DOM (read)"what's the headline on this page?"CDPstate-read
Desktop app window"open Finder → Downloads"CUelement-action
Chrome chrome (non-DOM)"switch to the tab on the right"CUelement-action
Native dialog input"type into the macOS dialog"CUvalue-injection
Global shortcut"press Command-Tab"CUkeyboard-action
Arbitrary pixel"click at (812, 514)"CUpointer-action
Canvas / iframe / WebGL"click the Play button (no DOM ref)"CDP+CUpointer-action+vision
DOM lookup + pointer click"find button via DOM, click with cursor"CDP+CUelement-actionpointer-action

Resolution Order

  1. Explicit opt-in: Message contains $computer-use or /computer-use → Computer Use path immediately. If tools unavailable, stop with precondition failed: computer-use unavailable.
  2. Can the target be addressed by cli-jaw browser snapshot --interactive ref? → CDP.
  3. Non-DOM web widget (Canvas, WebGL, iframe) visible in get_app_state screenshot? → Computer Use click(x, y).
  4. Target outside any webpage (app window, menu bar, OS dialog)? → Computer Use.
  5. Pixel coordinate the user gave verbatim? → Computer Use pointer-action.
  6. Need DOM to locate but user insists on real cursor click? → Hybrid.
  7. App name unclear? → list_apps() before get_app_state(app).
  8. None matched → stop and report needs boss follow-up: ambiguous target.
get_app_state(app) | Target visible in screenshot? |-- YES + in element tree --> element_index click |-- YES + NOT in tree --> click(x, y) immediately '-- NO --> scroll/zoom/search, then re-read

Absolute Rules

These rules are non-negotiable. Violating any of them produces incorrect transcripts or wasted actions.

  1. Announce the path before acting. First line of every task must be path=cdp, path=computer-use, or path=cdp+cu.
  2. State-first rule. Computer Use always starts each assistant turn with get_app_state(app) before interacting. Re-call it on stale warnings, after UI-changing actions, and whenever confidence drops.
  3. Every meaningful action records an action_class. Classes: state-read, element-action, value-injection, keyboard-action, pointer-action, pointer-action+vision.
  4. Never fall back silently. If the required path is unavailable, stop and report which precondition failed.
  5. Never claim the cursor was visible. Cursor overlay is best-effort in the current build.
  6. When uncertain, take a screenshot FIRST. Never chain actions through uncertainty. If two consecutive actions produced ambiguous state, the next call must be a state-read.
Screenshot-before-guess (hard rule): If you are not 100% certain of the current state — which tab is focused, whether a click landed, which element_index is the target — stop and call get_app_state(app) before any next action. Guessing indices leads to infinite correction loops. One extra state-read always beats one wrong click.

Computer Use Tool Surface

The Computer Use path exposes these tools through the active runtime:

ToolSignatureUse
list_apps()Discover available/recent apps when name is unknown
get_app_state(app)Start or refresh session; returns screenshot + accessibility tree. App may be a display name, bundle id, or full path.
click(app, element_index, click_count?, mouse_button?) or (app, x, y, click_count?, mouse_button?)Click by element ref or raw coordinates
drag(app, from_x, from_y, to_x, to_y)Drag by coordinates
press_key(app, key)Key or combo: Return, Tab, super+c
scroll(app, element_index, direction, pages)Scroll a scrollable accessibility element
set_value(app, element_index, value)Set an editable accessibility element's value
select_text(app, element_index, text, selection?, prefix?, suffix?)Select exact text or place the cursor before/after it inside a text element
type_text(app, text)Type literal text into current focus (verify focus first!)
perform_secondary_action(app, element_index, action)Invoke a secondary accessibility action
Prefer set_value over type_text. type_text(app, text) fires keystrokes at whatever has focus — fragile unless the latest state proves the cursor is in the intended field. set_value(app, element_index, value) targets a specific element — deterministic and reliable.

CDP Path — cli-jaw browser

For anything addressable by Chrome DOM: page text, form fields, buttons inside the page. 10x faster than Computer Use for web content.

Session Commands

cli-jaw browser status                         # is a session active?
cli-jaw browser start --agent                  # headless automation session
cli-jaw browser start --headless               # manual headless (WSL/CI/Docker)
cli-jaw browser start                          # visible window (only if user asks)
cli-jaw browser navigate "https://example.com" # go to URL
cli-jaw browser open "https://example.com"     # open in new tab
cli-jaw browser tabs                           # list tabs

Snapshot + Act

cli-jaw browser snapshot --interactive   # list clickable elements (e1, e2, ...)
cli-jaw browser click e3                 # click by ref
cli-jaw browser type e5 "query" --submit # type + Enter
cli-jaw browser press Enter              # fallback key press
cli-jaw browser screenshot               # save a PNG
cli-jaw browser text                     # dump visible text
Ref IDs reset after every navigate. Always re-snapshot after navigating to a new page.

CDP Action Class Mapping

CommandAction ClassContract
snapshot, text, screenshotstate-readTX-00
click, type ... --submitelement-actionTX-01
type <ref> "value" (no submit)value-injectionTX-02
press Enter / press Escapekeyboard-action

Preconditions

Computer Use Path

CDP Path

If any blocker fails, report precondition failed: <name> and stop. Never silently switch paths.

Transcript Format

Every action is logged in a structured transcript. The boss parses these — do not omit the path= line.

CDP Transcript

path=cdp
url=https://example.com
action=click e3
result=ok

Computer Use Transcript

path=computer-use
app=Google Chrome
action_class=element-action
action=click(element_index=730)
stale_warning=no
result=ok

Hybrid Transcript

path=cdp+cu
lookup=cli-jaw browser snapshot -> bbox of "Play"
action_class=pointer-action
action=click(x=812, y=514)
result=ok

Error Transcript

path=computer-use
app=Finder
action_class=element-action
action=click(element_index=34)
stale_warning=yes
result=error: stale element tree

Stale Recovery Pattern

When stale_warning=yes is returned, the element tree has changed. Follow this exact sequence:

  1. get_app_state(app) — re-ground; note the real element_index of your target.
  2. Log action_class=stale-recovery (CU-06) with a one-line note (reason=disambiguation).
  3. Re-issue the intended action using the fresh element_index.
  4. After that action, get_app_state(app) once more to confirm the effect.
# Stale warning received
path=computer-use
app=Google Chrome
action_class=stale-recovery
action=get_app_state("Google Chrome")
stale_warning=no
result=ok (52 elements -- tree changed)

# Retry with fresh index
path=computer-use
app=Google Chrome
action_class=element-action
action=click(element_index=38)   # same target, new index
stale_warning=no
result=ok
Never retry with the old index. Always re-read state first. The index you memorized is gone after any state change.

Vision-Click Integration

Vision-click is one specific tactic inside the broader UI targeting problem. It is legacy and generally superseded by direct coordinate clicking from screenshots.

Decision Order

  1. Did cli-jaw browser snapshot --interactive return a ref? → CDP click.
  2. Is Computer Use available and target visible in screenshot? → click(x, y) pointer-action directly (preferred).
  3. No ref and direct coordinate unsuitable → cli-jaw browser vision-click (Codex-only, legacy).
# CDP vision-click (inside Chrome, Codex-only)
cli-jaw browser vision-click "Submit button"
cli-jaw browser vision-click "Play button" --double

# Desktop vision-click (CU-05)
# 1. get_app_state(app) to capture screenshot
# 2. Vision model identifies coordinates
# 3. click(x=vx, y=vy) via Computer Use
Always try ref-based click first. Vision-click consumes tokens and adds latency. If the accessibility tree exposes an element_index, use that instead.

Worked Example: Chrome → Spotify Trace

A real end-to-end trace demonstrating state-first, element targeting, stale recovery, and the CDP speed switch.

1. State First

path=computer-use
app=Google Chrome
action_class=state-read
action=get_app_state("Google Chrome")
stale_warning=no
result=ok (47 elements, focused tab: open.spotify.com)

2. Element-Index Targeting (not focus-only typing)

# Click the search input field
path=computer-use
app=Google Chrome
action_class=element-action
action=click(element_index=12)
stale_warning=no
result=ok

# Set search value deterministically
path=computer-use
app=Google Chrome
action_class=value-injection
action=set_value(element_index=12, value="Daft Punk")
stale_warning=no
result=ok

3. Stale Warning → Recovery

# Search results load asynchronously -- tree changes
path=computer-use
app=Google Chrome
action_class=element-action
action=click(element_index=34)
stale_warning=yes
result=error: stale element tree

# Correct recovery
path=computer-use
app=Google Chrome
action_class=stale-recovery
action=get_app_state("Google Chrome")
result=ok (52 elements -- tree changed)

path=computer-use
app=Google Chrome
action_class=element-action
action=click(element_index=38)   # same result, new index
stale_warning=no
result=ok

4. Switch to CDP for 10x Speed

# Realize all targets are DOM nodes -- swap to CDP
path=cdp
url=https://open.spotify.com
action=cli-jaw browser snapshot --interactive
result=ok (ref IDs: e1..e89)

path=cdp
url=https://open.spotify.com
action=click e42   # "Shuffle Play" button
result=ok
Speed comparison: ~120 ms per CDP action vs ~1200 ms per Computer Use action. When the target is web DOM, CDP wins by an order of magnitude. Always prefer CDP for web content.

Common Failures & Correct Responses

SymptomCorrect Report
"I don't see a cursor"cursor overlay is best-effort in the current build -- action=click(...) succeeded; visible cursor not guaranteed
CDP server not runningprecondition failed: cli-jaw serve not running. Start with 'jaw serve' and retry.
Computer Use tools missingprecondition failed: computer-use unavailable
cli-jaw CU app missing (packaged install)precondition failed: /Applications/Codex Computer Use.app missing. Recover: jaw doctor --tcc --fix
Stale warning on actionRe-call get_app_state(app) then retry; log stale_warning=yes
No ref found for targetIf visible in screenshot, use Computer Use click(x, y). Otherwise report gap.
Navigation drift between snapshot and clickRe-snapshot and retry once; if still off, report.
CDP session died mid-taskprecondition failed: cdp session terminated
Vision model returns "no match"vision lookup failed for "<query>" -- suggest a more specific description
Non-GUI task routed hereneeds boss follow-up: not GUI automation

Contract ID Reference

All routing maps to capability contracts defined in the internal Computer Use spec. Use these IDs in transcripts when a contract name is needed.

PathContractDescription
CDP state-readTX-00Read page state via snapshot/text/screenshot
CDP element-actionTX-01Click/type by DOM ref
CDP value-injectionTX-02Set field value without submit
CDP hybrid-lookupTX-03DOM lookup for hybrid path
CDP vision-lookupTX-04Vision-based coordinate lookup (Codex-only)
CDP error-reportTX-05Report CDP errors
CU state-readCU-00get_app_state screenshot + tree
CU element-actionCU-01Click by element_index
CU value-injectionCU-02set_value / type_text
CU keyboard-actionCU-03press_key combos
CU pointer-actionCU-04Raw coordinate click
CU pointer-action+visionCU-05Vision-guided coordinate click
CU stale-recoveryCU-06Re-read after stale warning
CU precondition-failCU-07Report precondition failure
CU confirmation-promptCU-08Ask before destructive action
CU transcript-summaryCU-09After-action summary for boss
CU secondary-actionCU-10Accessibility secondary actions
CU scroll-actionCU-11Scroll a scrollable element
CU drag-actionCU-12Drag by coordinates

"~해줘" Usage Examples

Natural Korean requests that trigger the desktop-control skill. The routing system understands both Korean and English.

Example 1 — Web Automation (CDP)

"네이버에서 '서울 날씨' 검색해줘"

Path: CDPnavigate to naver.com → snapshot --interactivetype into search field → click submit

Example 2 — Desktop App (Computer Use)

"Finder에서 다운로드 폴더 열어줘"

Path: CUget_app_state("Finder")click(element_index=N) on Downloads → verify with get_app_state

Example 3 — Chrome Tab Switching (Computer Use)

"크롬 탭 Gmail로 전환해줘"

Path: CUget_app_state("Google Chrome") → find Gmail tab in accessibility tree → click(element_index=N)

Example 4 — Map Click (Hybrid)

"네이버 지도에서 '스타벅스' 라벨 클릭해줘"

Path: CDP+CUget_app_state screenshot shows label but not in element tree → click(x=719, y=388) pointer-action from screenshot coords

Example 5 — System Settings (Computer Use)

"시스템 설정에서 다크 모드 켜줘"

Path: CUlist_apps()get_app_state("System Settings") → navigate to Appearance → click Dark mode option

Rules: Never Do

Related Skills

SkillRelationship
browserCDP command reference — this skill supersedes its coverage
screen-captureGeneric macOS screenshot / webcam / video recording (unchanged, separate concern)
vision-clickNo longer auto-active; absorbed as a tactic. Install with cli-jaw skill install vision-click for the low-level recipe (NDJSON parsing, DPR correction)