τ-voice is a benchmark for real-time voice agents on 278 grounded customer-service tasks across retail, airline, and telecom. It pairs deterministic, end-to-end task scoring with realistic, controllable audio — diverse personas, environmental noise, and free-form turn-taking. It does this by inheriting the tasks, tools, policy documents, and evaluator from τ-bench, so every voice number here can be read directly against its text counterpart.
The voice frontier on τ-voice has moved from ~30% pass@1 in late 2025 to ~67% today, against a current text ceiling of ~85% — voice now retains about ~79% of text capability (up from ~45% eight months earlier) and the latest jump (+29 pp in roughly two months) tracks the same reasoning-unlocks-tool-use pattern that played out in text. Across the failure analysis, 79–90% of failures are genuine agent errors, not simulator artefacts. The benchmark is doing what we hoped: surfacing real, fast progress on real tasks.
Why voice agents need end-to-end evaluation
Voice is rapidly becoming a primary interface for agentic systems.
And yet today's evaluation landscape splits voice agents in half. Audio benchmarks measure conversational dynamics — does the model interrupt politely, yield gracefully, recognise a backchannel, sound natural under noise? — but rarely check whether the agent actually solved the caller's problem. Task-completion benchmarks, conversely, are well established in the text domain (τ-bench among them): they rigorously verify that the agent called the right tool, followed the right policy, and changed the database the right way — but they assume a clean text channel and never expose the agent to real audio.
The risk is that we ship voice agents that hold a charming conversation while quietly failing the underlying task — or, conversely, agents that nail the task in writing but fall apart the moment a real caller starts talking over them.
A customer calls to make changes to their account. Background noise from a busy street and an unfamiliar accent push the speech recogniser off, and authentication fails.
- 1Does the agent ask them to spell their name?
- 2If they spell it, does the agent transcribe it correctly?
- 3If so, does it actually fix the failed authentication call — or does it lose track of the corrections spread across three turns?
None of the standard benchmarks would catch a failure like this, because each step looks fine in isolation. Measuring both dimensions together — task completion and conversational dynamics, on the same call, under realistic audio — is what lets us see these integrated failures, quantify how much of an agent's text capability survives the move to voice, and surface regressions for the people most affected: speakers with non-standard accents, callers from noisy environments, users on degraded connections. Robustness to realistic audio conditions is an accessibility issue.
This is also a particularly timely thing to measure. Audio-native models — systems that ingest and produce speech end-to-end, without an intermediate transcript — are the next frontier of agentic AI: generally available from OpenAI, Google, and xAI, and improving fast. A benchmark sensitive to both task and conversation gives us a way to track exactly how quickly that frontier is moving.
Introducing τ-voice
τ-voice is the first benchmark to combine three things that have so far been evaluated in isolation:
- Verifiable, grounded tasks — shared with the text benchmark. 278 customer-service tasks inherited from τ-bench, scored deterministically against the final database state. The tasks, the tools, the policy documents, and the evaluator are byte-for-byte identical to those used by text agents on the τ-bench leaderboard. That means the voice numbers in this post can be compared directly against a text agent on the same task — voice vs. text, on the exact same problem, no apples-to-oranges caveat.
- Live, simultaneous speech. Both the user and the agent can speak at the same time, with overlap, interruptions, and backchannels — the full-duplex regime, in contrast to the half-duplex, strictly turn-based setting of text agents. A tick-based orchestrator coordinates 200 ms audio chunks in both directions, lets the agent be interrupted mid-sentence, and gives us precise, repeatable control over turn-taking timing.
- Realistic, controllable audio. A voice user simulator that synthesises caller speech with diverse personas, mixes in environmental noise, applies telephony compression, drops frames, and decides — turn by turn — whether to interrupt, yield, or backchannel.
One implementation note worth flagging: the major voice provider APIs (OpenAI Realtime, Gemini Live, xAI Grok) don't require a simulated call to play out in real time. We can run a session at whatever pace we want without changing what the agent hears, which means the user simulator isn't held to a realtime latency or token budget — we're free to pick whatever text LLM works best for simulating the caller (GPT-4.1 in our experiments). Strong simulator, precise control over turn-taking, and reproducible runs — we did not have to choose.
The voice user simulator, end to end
Each tick, the simulator does four things in sequence: it generates the next caller utterance as text, synthesises it through a voice persona, mixes the synthesised speech with environmental audio (background noise, vocal tics, non-directed speech), and applies channel degradation (G.711 µ-law compression at 8 kHz, dynamic muffling, frame drops via a Gilbert-Elliott model). A separate LLM-driven turn-taking policy, evaluated every two seconds, decides whether to interrupt, yield, or backchannel.
Voice models are improving fast — really fast
Because τ-voice inherits its tasks, tools, and evaluator from τ-bench, voice numbers can be plotted directly on the same axis as text. The chart below is the τ-voice progress timeline — the same view rendered by the live leaderboard's Progress over time panel — overlaid with two text reference lines: the current text reasoning ceiling (~85% pass@1, Gemini 3 Pro / GPT-5.2 / Claude Opus 4.5) and a strong non-reasoning text baseline (54%, GPT-4.1).
In about eight months the voice frontier has moved from 30% (OpenAI's gpt-realtime-1.0, Aug 2025) to 67% (xAI's grok-voice-think-fast-1.0, Apr 2026), crossing the non-reasoning text line and closing most of the way to the reasoning ceiling. The biggest single move is the most recent one — a +29 pp jump in roughly two months, driven by xAI's reasoning-enabled audio-native model — and the pattern is familiar from text: adding explicit reasoning to the audio-native model unlocks a step change in tool-use reliability. Voice has gone from retaining roughly ~45% of text capability when the paper was written to ~79% today. Same domains, same evaluator, no asterisks.
To explore the full leaderboard — including per-domain breakdowns, custom submissions, the same Progress-over-time panel, and the underlying trajectories — jump straight to the τ-voice ranking (or the τ-bench text ranking for direct comparison). We've worked with every major audio-native provider so far; the chart above will keep moving.
What's actually going wrong
Before zooming into the failures themselves, it helps to see how each provider's pass@1 changes when we move from Clean (single persona, no acoustic effects, strict turn-taking) to Realistic (diverse personas, environmental noise, free-form turn-taking) on the paper-era models — the absolute drop varies, but every provider takes a hit.
Knowing each provider takes a hit is one thing; knowing which mistakes drive the hit is another. Two annotators labelled every failed simulation in two analysis cohorts — Voice-Fragile (tasks the text models pass but voice-Clean models fail) and Noise-Fragile (tasks voice-Clean passes but voice-Realistic fails) — tagging both the source and type of the first critical error.
The four failure modes that matter most
m-e-i-p, the agent hears n-e-a-p, the lookup fails, and the agent loops. Voice-native models lose name/email letters even at the spelling step — the part designed to be unambiguous.
How much does each ingredient of "realistic" compound these?
To quantify how much each part of "realistic" contributes to the failure modes above, we ran ablations on the retail domain — adding background noise, diverse accents, and turn-taking dynamics one factor at a time, on top of an otherwise clean condition.
| Condition | OpenAI | xAI | Average | |
|---|---|---|---|---|
| Clean | 45 | 71 | 48 | 55 |
| + Noise | 40−4 | 67−4 | 46−2 | 51−4 |
| + Accents | 44−1 | 60−11 | 30−18 | 44−10 |
| + Turn-taking | 33−11 | 57−14 | 52+4 | 47−7 |
| Realistic (all) | 30−15 | 45−26 | 39−10 | 38−17 |

What τ-voice does not (yet) measure
We are deliberate about scope. A few things τ-voice does simplify, and where we plan to go next:
- English only, TTS-mediated accents. Diverse accents are induced by ElevenLabs personas, not by recorded human speakers. This is enough to surface large per-provider gaps, but the absolute numbers should be read as indicative rather than definitive. We are scoping τ-voice strictly to English and to TTS-driven personas; broader language and recorded-speaker coverage is left to future benchmarks.
- Transcript injection on the user side. The user simulator reads the agent's transcript directly rather than transcribing the agent's audio. In our manual review, agent speech was intelligible in 100% of 91 sampled simulations — agent-side ASR is a non-issue today — but as voice models get more expressive this assumption may need to be revisited.
- No agent speech-quality scoring. We measure what the agent says (and whether it took the right action), not how naturally it says it. Adding tone, naturalness, and user-perception metrics is straightforward future work.
- Cascaded baselines. The framework supports cascaded ASR→LLM→TTS pipelines as well as audio-native models, but we have not yet published a head-to-head comparison. This is the cleanest way to isolate "voice modality" from "model architecture" effects.
Open, reproducible, and yours to build on
τ-voice is part of the broader τ-bench framework. Tasks, environment, voice user simulator, audio effects, turn-taking policy, and evaluation are all open source. Every result here is reproducible from a fixed seed (LLM stochasticity aside), every audio sample on the examples page comes from a real τ-voice run, and every voice submission on the leaderboard ships with its trajectories so you can replay the conversations end-to-end.
The official voice personas are held out, but a one-command script generates equivalent ones via the same ElevenLabs Voice Design API, so external developers can iterate locally and expect improvements to carry over to the official eval.
If you train voice models, evaluate voice agents, or just want to understand where today's systems break down, we'd love your contributions — new audio-native providers, cascaded ASR→LLM→TTS baselines, and pull requests against the user simulator are all welcome. Voice agents will be in production whether or not we measure them carefully. We'd rather measure them carefully.
For full details, see our paper, the code, and the leaderboard. The framework was built by Soham Ray, Keshav Dhandhania, and Victor Barres at Sierra, with Karthik Narasimhan at Princeton.