← Tap · Blog

Playwright MCP vs Tap vs Browserbase: A Token-Cost Head-to-Head

May 13, 2026 · Leon Ting · 9 min read

If an MCP host (Claude Code, Cursor, Cline, Windsurf, VS Code) needs to drive a browser, three credible servers ship today: Microsoft's Playwright MCP, Browserbase's Stagehand (typically deployed on Browserbase cloud), and Tap. They look like substitutes from the outside — one tool slot, one MCP server, "give the agent a browser." Underneath, they pick three different answers to the same four questions: where does the browser run, what credentials does it see, who pays the tokens, and what happens when the page changes.

Here's the head-to-head on a single screen, then the reasoning.

The comparison table

DimensionTapPlaywright MCPBrowserbase + Stagehand
Where the browser runs Your own Chrome, locally (extension) Local Playwright (headless, or --extension) Cloud (Browserbase)
Logged-in / cookies ✓ uses your live session --extension mode reuses Chrome profile; headless = none ✗ credentials must be uploaded to cloud
Tokens per run (after first compile) 0 (deterministic replay) Per-call (~9.6K for HN top-30 extract) Per-call (LLM at runtime)
Drift handling tap fix patches the plan once; later runs free None — re-extracts every time None — re-extracts every time
AI position Compile-time only Runtime Runtime
Pricing Free / OSS (MIT) Free / OSS (Apache-2.0) Paid cloud minutes + LLM tokens
Trust boundary User's browser; nothing leaves the machine User's machine (Playwright process) Browserbase's cloud
Best for Authenticated workflows you run repeatedly Quick one-off scripts from an MCP host Multi-tenant / isolated cloud workloads

The thesis up front: these are not three implementations of one product, they are three different products. Each is the right answer to a different question. The interesting work is figuring out which question you actually have.

Where the browser runs — the load-bearing distinction

Everything else flows from this.

Playwright MCP launches a Playwright-driven Chromium on the same machine as the MCP server process. By default it's headless and uses an empty profile — no cookies, no extensions, no autofill, no logged-in sites. Recently Microsoft shipped a --extension flag that connects to an already-running Chrome via a companion extension, which closes most of the auth gap. You can now point Playwright MCP at your real session, with some setup friction (install the extension, keep Chrome open, configure the connection).

Stagehand on Browserbase launches a Chromium inside Browserbase's data centers and exposes it via Stagehand's act/extract/observe primitives, each of which calls an LLM. Browserbase is a paid service; Stagehand (Apache-2.0, 22K stars) is open-source and you can run it against a local browser too, but its production deployment path is the cloud. Auth in this model means uploading cookies, OAuth tokens, or full session state to Browserbase.

Tap doesn't launch a browser at all. The browser is the one you already have open — logged into GitHub, Cursor's billing page, your Shopify admin, your bank. A Chrome extension is the runtime; the MCP server orchestrates from the OS side. Your live session is the auth model. Nothing about credentials crosses a trust boundary because there's no trust boundary to cross.

This is the same architectural pattern that six unrelated indie products converged on independently. When you need logged-in state, the user's own browser is the only credential store that already has it.

Tokens: where the money goes

An MCP host invokes a tool, the tool returns data, the host's LLM continues reasoning. The interesting question is: how many tokens did the tool burn between input and output?

For a pure-Playwright tool (browser_navigate, browser_click) the answer is zero — deterministic. But the moment you ask "scrape the top 30 stories from Hacker News and return them as JSON," the tool itself needs an LLM, because Playwright doesn't understand "top 30 stories." Either the MCP host's own agent loop drives it step-by-step (DOM snapshot → reasoning → next action → repeat), or a Stagehand-style extract() call ships the page to an LLM and asks for structured output. Either way: per-call tokens.

We measured this directly in an earlier post, Compile Once. Run Forever. Diff the Drift. Same workload (HN top 30), four arms:

ArmTokens / call
A — naive LLM reads page, returns JSON (Playwright-MCP shape under an extracting host)9,625
C — Tap router: deterministic replay, no LLM at runtime0
C — Tap with a drift event: V report + minimal patch1,134 (once)
D — oracle MDL floor1

To be fair to Playwright MCP: Arm A isn't literally what playwright-mcp does. Playwright MCP returns DOM snapshots and exposes click/type/extract tools; the per-call token cost depends on whether the host's agent loop scrolls through the DOM with discipline or just dumps it into the model. But the architectural shape — extraction happens at runtime, the LLM sees the page on every invocation — is Arm A. That's the cost shape any runtime-extracting MCP server lives inside.

For Stagehand the shape is the same. extract({ schema }) ships the page to an LLM; act("click the submit button") ships a DOM context to an LLM. The improvements over a naive agent loop are real (better selectors, better caching, fewer retries) but the per-call ceiling is set by the architecture, not by engineering effort. Stagehand's own docs publish per-call costs in the cents range, scaling linearly with calls.

Tap's number after the first compile is 0. Not 0.001, not "very small" — zero. The compiled .plan.json is a closed-union 11-op program that the extension replays deterministically; no model is called during execution. Across 100 queries per drift, Tap is 849× cheaper than Arm A. Across 1,000 it's 8,489×.

This advantage only exists because the workload is repeated. If you scrape HN once and never again, Tap's compile step (one LLM call) is roughly the same cost as Arm A's single extraction. The compiler model needs amortization to be cheap. For one-shot research tasks, runtime extraction is the right tool.

Drift: what happens when the page changes

Playwright MCP and Browserbase + Stagehand have no concept of drift. They re-extract from a fresh page on every call, so a site change is invisible until it produces wrong-shaped output. If a class rename causes the extraction to silently drop a field, the LLM cheerfully returns the field as null or "not found." The cost of the failure is paid downstream — in your monitoring, in your wrong dashboard, in the user noticing days later.

This is also their strength on novel pages. Because they extract from scratch every time, they handle a site they've never seen with no setup. There's no compile step to invalidate. The LLM looks at the page and figures it out.

Tap takes the opposite trade. A tap is a compiled plan; when the underlying page changes enough to break it, a cross-validator (doctor + an independent authoritative source like an RSS feed or public API) catches it. tap fix then makes one LLM call to emit a minimal patch — 1,134 tokens on Sonnet through the production heal path — and the tap goes back to running for free. On a scraper that runs hourly and drifts monthly (~700 queries between drifts), that's ~10,000× cheaper than re-extracting every time.

The cost of this trade: Tap needs an authoring step. You can't point it at a brand-new site and immediately get JSON back. You spend a few minutes (or one LLM call via tap capture) compiling a plan first. That up-front cost is the price of admission for the zero-token replay.

Trust boundary: where credentials live

The single biggest difference between these three tools isn't tokens; it's the credential model. For unauthenticated work (HN, Wikipedia, public docs) the question doesn't matter — pick on price and speed. For anything behind a login, it dominates everything else.

This is the dimension where the three tools stop being substitutes. For an automation that needs to schedule, retry, and run when your laptop is asleep, you want Browserbase. For an automation that touches your live HubSpot or Stripe dashboard, you want Tap. For a one-shot agentic task in Claude Code that browses public docs and reports back, you want Playwright MCP.

How to pick

A decision tree that resolves most cases in one question each:

  1. Does the task need a logged-in session? If no — Playwright MCP is the obvious default. Free, ships in every MCP host's catalog, no setup. If yes — continue.
  2. Is it OK to upload those credentials to a third-party cloud? If yes — Browserbase + Stagehand. You get always-on execution, parallelism, isolation. If no — continue.
  3. Is this a one-shot exploration, or a workflow you'll run repeatedly? If one-shot — Playwright MCP with --extension is reasonable; the per-call tokens are paid once. If repeated — Tap. The amortization math only makes sense when "run forever" is on the table.

And the inverse cases — where each tool clearly loses:

The honest caveats

Playwright MCP's --extension flag is recent and good. It closes the auth gap that older comparisons (including our own from April) lean on. The remaining gaps for Playwright MCP versus Tap are not auth — they are still tokens-per-call and drift handling. Both real, neither uniquely about authentication.

Browserbase + Stagehand is a serious product. Stagehand is the best-engineered LLM-driven browser agent we've seen, and Browserbase's reliability/parallelism are genuinely useful for production agentic systems with cloud-isolation requirements. The "credentials must be uploaded" point isn't a bug — it's a deliberate architectural trade. For some teams (multi-tenant SaaS that automates customer accounts under signed agreements) cloud isolation is exactly what they want. We don't have firsthand token cost numbers for Stagehand on the HN workload, so we don't claim a multiplier — only the architectural point that runtime extraction has a per-call floor that Tap's deterministic replay doesn't.

Tap's compile-time AI is not magic. The first time you author a tap, an LLM call is involved (forge step). The 0-token claim is for execution after authoring. If your workload is "do this once and never again," Tap's amortization advantage is zero. Our experiment made this explicit: at 1 query per drift, Arm C is only 8.5× cheaper than Arm A. The 849× and 10,000× numbers require repeated use.

None of this is about model quality. All three architectures could use the same Sonnet/Haiku/GPT and get different cost curves. The differences are about when the LLM enters the loop — runtime versus compile-time — not about which model is smarter.

Three different products

The frame to leave with: these aren't tiers of the same product. They are three answers to three different questions.

If your question is one of these three, the right answer is unambiguous. If it's two of them at once — "I want cloud isolation but also zero runtime tokens" — nothing in this comparison fits cleanly, and you're probably building a new category of tool, not picking one off this list.

Install Tap

brew install LeonTing1010/tap/taprun
tap mcp stdio
tap hackernews/hot

Tap is open-source under MIT. If you're picking a browser-automation MCP for a logged-in workflow that runs more than once, we want the head-to-head on your real workload. Issues and benchmarks welcome.