iphone-use — Agent API

Connect-in HTTP control for agents (Hermes, Claude MCP clients, scripts).

An agent drives the iPhone by talking to the already-running daemon over HTTP — it does not spawn its own input process. This is a hard requirement on macOS: under the responsible-process rule a spawned child's synthetic events (CGEvent) are untrusted and silently dropped, whereas the daemon runs as a launchd LaunchAgent with Screen Recording + Accessibility granted. So the agent connects in and the daemon injects through the same validated path as the human WebRTC client.

Authentication

Every /agent/* endpoint requires a bearer token. Two modes:

ModeWhenToken to use
Dedicated token PHONE_REMOTE_AGENT_TOKEN is set on the daemon That token — the password is no longer accepted as a bearer
Password fallback (legacy) PHONE_REMOTE_AGENT_TOKEN is unset PHONE_REMOTE_PASSWORD
Authorization: Bearer <PHONE_REMOTE_AGENT_TOKEN or PHONE_REMOTE_PASSWORD>

When the daemon runs with no password and no agent token (open LAN-dev mode) the agent API is open too — the same policy as the browser cookie gate. The bearer is compared in length-checked constant time.

Endpoints

MethodPathPurpose
GET/agent/statusAuth / health probe.
POST/agent/inputInject one control message.
GET/agent/screenshotCurrent phone screen as PNG.
GET/agent/elementsThe UI as a flattened element list (L2 / WDA).
GET / POST/agent/inboxStructured results POSTed back by the phone (Shortcuts bridge); GET drains, ?peek=1 keeps.
POST/agent/modeSwitch between mirror (live video) and agent (WDA element layer) — they are mutually exclusive.

GET /agent/status

Returns 200 with JSON when the bearer is valid.

{ "ok": true, "phone_target": true, "wda": false,
  "drivable": false, "mirror_state": "in_use",
  "hint": "iPhone in use — LOCK the phone to reconnect; the on-screen Connect button will not reconnect while it is in use",
  "mode": "mirror", "viewer_count": 0,
  "version": "0.2.0", "latest": "v0.2.0", "update_available": false }

phone_target is true when the iPhone Mirroring window is currently visible on-screen (a cheap ScreenCaptureKit probe at request time). wda is true when the optional L2 element layer (a WebDriverAgent on the phone, PHONE_REMOTE_WDA_URL) is live — it unlocks /agent/elements, label-taps, and clean CJK text. Without it, all event types use fully native CGEvent injection against the Mirroring window. mode is derived: "agent" when WDA is up (live Mirroring video is then impossible — see /agent/mode), "mirror" when the Mirroring window is up without WDA, else "offline". version is the running daemon; latest is the newest GitHub release tag (refreshed daily; null until first fetch / offline; disable with PHONE_REMOTE_NO_UPDATE_CHECK=1). When update_available is true, upgrade with the install one-liner — the web client shows the same hint as a banner.

POST /agent/mode

The on-phone XCUITest runner (WDA) and iPhone Mirroring are mutually exclusive — while the runner is alive, Mirroring shows “Connection Interrupted” and cannot reconnect, even with the phone locked (hardware-verified; see the WDA guide, pitfall ⑨). This endpoint orchestrates the switch:

curl -s -H "$AUTH" -X POST "$HOST/agent/mode" -d '{"mode":"agent"}'
# → {"ok":true,"mode":"agent","starting":true,"log":"~/.iphone-use/wda-mode-switch.log",
#    "hint":"if the phone is locked, unlock it once now — the launcher waits for it"}
# poll /agent/status until "wda":true (≈10 s when the runner app is already installed)

curl -s -H "$AUTH" -X POST "$HOST/agent/mode" -d '{"mode":"mirror"}'
# → {"ok":true,"mode":"mirror","switching":true,...}
# locks the phone via WDA (Mirroring connects only to a locked phone), stops the
# runner + relay, brings Mirroring frontmost and taps its Try Again button.
# Fully automatic; live video is back in ≈10 s — verify via /agent/screenshot.

The one asymmetry: mode=agent needs the phone unlocked at launch (Apple blocks launching the runner on a locked device, and Face ID can’t be automated) — if it’s locked, one human unlock completes the switch; the launcher waits. mode=mirror is fully automatic. Requires scripts/setup-wda.sh to have run once (it self-installs to ~/.iphone-use/).

POST /agent/input

Body is a single control message — the same JSON shape as the WebRTC control channel. Coordinates are normalized [0,1] over the phone's content rect (geometry-agnostic; 0,0 = top-left, 1,1 = bottom-right).

MessageEffect
{"type":"tap","x":0.5,"y":0.5}Tap at centre (on-device via WDA when live, else Mirroring cursor).
{"type":"tap","label":"新备忘录"}(wda) Tap an element by its visible label — no coordinates, no drift. 502 if no match.
{"type":"scroll","x":0.5,"y":0.5,"dx":0,"dy":-40}Scroll up (negative dy) at the anchor. dx/dy are pixel deltas.
{"type":"text","text":"hello"}Type into the focused field. With wda:true any Unicode (incl. CJK) lands cleanly; otherwise US keycodes — see the CJK-IME caveat below.
{"type":"key","name":"return"}One named key (return, escape, space…).
{"type":"shortcut","name":"home"}System shortcut: home | spotlight | switcher.
{"type":"keyboard"}(wda) Dismiss the on-screen keyboard — after typing into a web form it covers the page's own submit/next buttons. (Also {"type":"key","name":"dismiss"}.)
{"type":"longpress","x":0.4,"y":0.6}Press & hold (release with a matching up).
{"type":"down"} / {"type":"up"}Low-level press / release for custom drags.

Returns 200 on accept, 400 on an unparseable message, 401 when the bearer is wrong. Optional header X-Agent-Id: <name> labels the control lease (default agent).

GET /agent/screenshot

Returns image/png of the current phone screen, captured via the built-in screencapture CLI (targets the Mirroring window by id, so it works regardless of which app is frontmost — no external binary required). 503 if the Mirroring window is not currently found.

GET /agent/elements

(Requires wda:true.) The phone's current UI flattened to agent-friendly rows, in document order:

{ "elements": [
  { "kind": "Button", "label": "新备忘录", "rect": [369, 885, 38, 38], "depth": 7 },
  ...
] }

Prefer this over /agent/screenshot for reasoning: it is text (an order of magnitude cheaper than vision), the labels feed straight into {"type":"tap","label":…}, and it works even while a human is holding the phone (no Mirroring window needed). 503 when WDA is not configured, 502 when configured but unreachable.

GET/POST /agent/inbox

The return path of the Shortcuts RPC bridge: the phone (an iOS Shortcut's "Get Contents of URL" action) POSTs arbitrary JSON here; an agent GETs it. GET drains the queue; GET /agent/inbox?peek=1 reads without draining. Bounded ring buffer (64), oldest dropped.

Example — a see-decide-act loop

HOST=http://192.168.0.190:44321
PW=<password>
AUTH="Authorization: Bearer $PW"

# 1. confirm reachable
curl -s -H "$AUTH" "$HOST/agent/status"            # {"ok":true,"phone_target":true,"wda":true}
curl -s -H "$AUTH" "$HOST/agent/elements"          # the UI as text — prefer over vision when wda:true

# 2. see
curl -s -H "$AUTH" "$HOST/agent/screenshot" -o screen.png

# 3. act — go Home, open Spotlight, type, scroll
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"shortcut","name":"home"}'
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"shortcut","name":"spotlight"}'
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"text","text":"Notes"}'
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"tap","x":0.5,"y":0.22}'
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"scroll","x":0.5,"y":0.5,"dx":0,"dy":-60}'

Behaviour notes