iphone-use icon iphone-use
Open source · macOS + iPhone Mirroring · Rust

Computer-use,
but for the iPhone Let AI agents see and drive a real phone.

Give AI agents (and your browser) eyes and hands on a real iPhone. Live WebRTC remote control in the browser; agents use one HTTP API to screenshot, read the UI element tree, tap and type — CJK text lands in one shot. Runs entirely on your own Mac, no third-party cloud.

Real-device demo: the agent types a whole CJK paragraph into Notes via the element layer
Real device · agent typed
a full CJK paragraph, zero mojibake

Install in 60 seconds

One command installs the daemon; one more teaches your agent to drive the phone.

# 1) Install the daemon on the Mac (auto-signed LaunchAgent; grant permissions once, keep them forever) $ curl -fsSL https://raw.githubusercontent.com/leeguooooo/iphone-use/main/install.sh | sh # 2) Teach any skills-capable agent (Claude Code, etc.) to drive your phone $ npx skills add leeguooooo/iphone-use # 3) (optional) Add the L2 element layer — element tree + tap-by-label + clean CJK $ ./scripts/setup-wda.sh

Three input layers

One agent API; the daemon auto-routes to the best path — fast paths get faster, the fallback never goes away.

L1 · VERBS

Shortcuts / App Intents

One curated bridge shortcut reaches native iOS APIs — battery, Health, location… Structured JSON comes back: fastest, deterministic, no vision needed.

L2 · ELEMENTS

Element tree (WebDriverAgent)

Reads iOS's own accessibility tree: tap by element label, send Unicode straight into fields — no host-cursor contention, no coordinate drift, CJK lands clean. The agent keeps seeing and acting even while a human holds the phone.

L3 · PIXELS

Mirroring + CGEvent

Screen stream + system-level input over iPhone Mirroring — the universal fallback that can see and tap anything. Also the human's low-latency WebRTC remote desktop.

CJK input: clean vs mangled

Same phone, same Pinyin keyboard — the real recorded results of both paths:

❌ Traditional pixel path (keycodes eaten by the Pinyin IME) 我爱上 typed 不要按 AI阿根廷鸥鸟啊 real iPh…
✅ iphone-use L2 element layer (Unicode straight into the field) 你好世界!This note was typed by the L2 layer — the whole CJK string landed in one shot.

The agent API

Three moves: see (elements or screenshot) → act → verify. A ready-made MCP server (9 tools) plugs straight into Claude.

# See — the UI as text, an order of magnitude cheaper than vision $ curl -H "$AUTH" $HOST/agent/elements {"elements":[{"kind":"Button", "label":"新备忘录","rect":[369,885,38,38]}…]}
# Act — tap by label / type Unicode directly $ curl -X POST $HOST/agent/input \ -d '{"type":"tap","label":"新备忘录"}' $ curl -X POST $HOST/agent/input \ -d '{"type":"text","text":"你好世界"}'

Why it isn't yet another screenshot poller

🔒 Fully self-hosted

The daemon runs on your Mac; your screen never leaves home. Password + bearer auth, login rate-limiting.

⚡ Hardware-encoded WebRTC

VideoToolbox H.264 over WebRTC — near-native latency in the browser; Cloudflare TURN for cross-network.

🤖 MCP + Skill

One config line for Claude Desktop / Claude Code; npx skills add teaches agents the “vision once → script forever” method.

🧠 The UI as text

/agent/elements turns the screen into an element list — reasoning gets drastically cheaper, labels feed straight into taps.

🛠 Hardware-validated engineering

Scroll must be a wheel event, Mirroring only takes keycodes, focus races, TCC signing… every pitfall is in the docs.

📦 Grant once, keep forever

A stable local signing identity — upgrades and reinstalls never re-prompt Screen Recording / Accessibility.