Connect-in HTTP control for agents (Hermes, Claude MCP clients, scripts).
An agent drives the iPhone by talking to the already-running daemon over
HTTP — it does not spawn its own input process. This is a hard requirement on macOS:
under the responsible-process rule a spawned child's synthetic events (CGEvent) are
untrusted and silently dropped, whereas the daemon runs as a launchd LaunchAgent with Screen
Recording + Accessibility granted. So the agent connects in and the daemon
injects through the same validated path as the human WebRTC client.
Every /agent/* endpoint requires a bearer token. Two modes:
| Mode | When | Token to use |
|---|---|---|
| Dedicated token | PHONE_REMOTE_AGENT_TOKEN is set on the daemon |
That token — the password is no longer accepted as a bearer |
| Password fallback (legacy) | PHONE_REMOTE_AGENT_TOKEN is unset |
PHONE_REMOTE_PASSWORD |
Authorization: Bearer <PHONE_REMOTE_AGENT_TOKEN or PHONE_REMOTE_PASSWORD>
When the daemon runs with no password and no agent token (open LAN-dev mode) the agent API is open too — the same policy as the browser cookie gate. The bearer is compared in length-checked constant time.
| Method | Path | Purpose |
|---|---|---|
| GET | /agent/status | Auth / health probe. |
| POST | /agent/input | Inject one control message. |
| GET | /agent/screenshot | Current phone screen as PNG. |
| GET | /agent/elements | The UI as a flattened element list (L2 / WDA). |
| GET / POST | /agent/inbox | Structured results POSTed back by the phone (Shortcuts bridge); GET drains, ?peek=1 keeps. |
| POST | /agent/mode | Switch between mirror (live video) and agent (WDA element layer) — they are mutually exclusive. |
Returns 200 with JSON when the bearer is valid.
{ "ok": true, "phone_target": true, "wda": false,
"drivable": false, "mirror_state": "in_use",
"hint": "iPhone in use — LOCK the phone to reconnect; the on-screen Connect button will not reconnect while it is in use",
"mode": "mirror", "viewer_count": 0,
"version": "0.2.0", "latest": "v0.2.0", "update_available": false }
phone_target is true when the iPhone Mirroring window is currently
visible on-screen (a cheap ScreenCaptureKit probe at request time).
wda is true when the optional L2 element layer (a WebDriverAgent
on the phone, PHONE_REMOTE_WDA_URL) is live — it unlocks
/agent/elements, label-taps, and clean CJK text. Without it, all event types
use fully native CGEvent injection against the Mirroring window.
mode is derived: "agent" when WDA is up (live Mirroring video is
then impossible — see /agent/mode), "mirror" when the Mirroring
window is up without WDA, else "offline".
version is the running daemon; latest is the newest GitHub
release tag (refreshed daily; null until first fetch / offline;
disable with PHONE_REMOTE_NO_UPDATE_CHECK=1). When
update_available is true, upgrade with the install one-liner —
the web client shows the same hint as a banner.
The on-phone XCUITest runner (WDA) and iPhone Mirroring are mutually exclusive — while the runner is alive, Mirroring shows “Connection Interrupted” and cannot reconnect, even with the phone locked (hardware-verified; see the WDA guide, pitfall ⑨). This endpoint orchestrates the switch:
curl -s -H "$AUTH" -X POST "$HOST/agent/mode" -d '{"mode":"agent"}'
# → {"ok":true,"mode":"agent","starting":true,"log":"~/.iphone-use/wda-mode-switch.log",
# "hint":"if the phone is locked, unlock it once now — the launcher waits for it"}
# poll /agent/status until "wda":true (≈10 s when the runner app is already installed)
curl -s -H "$AUTH" -X POST "$HOST/agent/mode" -d '{"mode":"mirror"}'
# → {"ok":true,"mode":"mirror","switching":true,...}
# locks the phone via WDA (Mirroring connects only to a locked phone), stops the
# runner + relay, brings Mirroring frontmost and taps its Try Again button.
# Fully automatic; live video is back in ≈10 s — verify via /agent/screenshot.
The one asymmetry: mode=agent needs the phone unlocked at launch
(Apple blocks launching the runner on a locked device, and Face ID can’t be
automated) — if it’s locked, one human unlock completes the switch; the launcher
waits. mode=mirror is fully automatic. Requires
scripts/setup-wda.sh to have run once (it self-installs to
~/.iphone-use/).
Body is a single control message — the same JSON shape as the WebRTC control
channel. Coordinates are normalized [0,1] over the phone's content
rect (geometry-agnostic; 0,0 = top-left, 1,1 = bottom-right).
| Message | Effect |
|---|---|
{"type":"tap","x":0.5,"y":0.5} | Tap at centre (on-device via WDA when live, else Mirroring cursor). |
{"type":"tap","label":"新备忘录"} | (wda) Tap an element by its visible label — no coordinates, no drift. 502 if no match. |
{"type":"scroll","x":0.5,"y":0.5,"dx":0,"dy":-40} | Scroll up (negative dy) at the anchor. dx/dy are pixel deltas. |
{"type":"text","text":"hello"} | Type into the focused field. With wda:true any Unicode (incl. CJK) lands cleanly; otherwise US keycodes — see the CJK-IME caveat below. |
{"type":"key","name":"return"} | One named key (return, escape, space…). |
{"type":"shortcut","name":"home"} | System shortcut: home | spotlight | switcher. |
{"type":"keyboard"} | (wda) Dismiss the on-screen keyboard — after typing into a web form it covers the page's own submit/next buttons. (Also {"type":"key","name":"dismiss"}.) |
{"type":"longpress","x":0.4,"y":0.6} | Press & hold (release with a matching up). |
{"type":"down"} / {"type":"up"} | Low-level press / release for custom drags. |
Returns 200 on accept, 400 on an unparseable message, 401
when the bearer is wrong. Optional header X-Agent-Id: <name> labels the control
lease (default agent).
Returns image/png of the current phone screen, captured via the built-in
screencapture CLI (targets the Mirroring window by id, so it works regardless
of which app is frontmost — no external binary required). 503 if the Mirroring
window is not currently found.
(Requires wda:true.) The phone's current UI flattened to
agent-friendly rows, in document order:
{ "elements": [
{ "kind": "Button", "label": "新备忘录", "rect": [369, 885, 38, 38], "depth": 7 },
...
] }
Prefer this over /agent/screenshot for reasoning: it is text (an order of
magnitude cheaper than vision), the labels feed straight into
{"type":"tap","label":…}, and it works even while a human is holding the
phone (no Mirroring window needed). 503 when WDA is not configured,
502 when configured but unreachable.
The return path of the Shortcuts RPC bridge: the phone
(an iOS Shortcut's "Get Contents of URL" action) POSTs arbitrary JSON here;
an agent GETs it. GET drains the queue; GET /agent/inbox?peek=1
reads without draining. Bounded ring buffer (64), oldest dropped.
HOST=http://192.168.0.190:44321
PW=<password>
AUTH="Authorization: Bearer $PW"
# 1. confirm reachable
curl -s -H "$AUTH" "$HOST/agent/status" # {"ok":true,"phone_target":true,"wda":true}
curl -s -H "$AUTH" "$HOST/agent/elements" # the UI as text — prefer over vision when wda:true
# 2. see
curl -s -H "$AUTH" "$HOST/agent/screenshot" -o screen.png
# 3. act — go Home, open Spotlight, type, scroll
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"shortcut","name":"home"}'
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"shortcut","name":"spotlight"}'
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"text","text":"Notes"}'
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"tap","x":0.5,"y":0.22}'
curl -s -H "$AUTH" -X POST "$HOST/agent/input" -d '{"type":"scroll","x":0.5,"y":0.5,"dx":0,"dy":-60}'