Research

τ-voice: benchmarking real-time voice agents on real-world tasks

A reproducible testbed for voice agents that need to listen, speak, reason, and act — under realistic phone conditions, on tasks that have a verifiable right answer.

April 2026 · Sierra Research · Paper · GitHub

TL;DR

τ-voice is a benchmark for real-time voice agents on 278 grounded customer-service tasks across retail, airline, and telecom. It pairs deterministic, end-to-end task scoring with realistic, controllable audio — diverse personas, environmental noise, and free-form turn-taking. It does this by inheriting the tasks, tools, policy documents, and evaluator from τ-bench, so every voice number here can be read directly against its text counterpart.

The voice frontier on τ-voice has moved from ~30% pass@1 in late 2025 to ~67% today, against a current text ceiling of ~85% — voice now retains about ~79% of text capability (up from ~45% eight months earlier) and the latest jump (+29 pp in roughly two months) tracks the same reasoning-unlocks-tool-use pattern that played out in text. Across the failure analysis, 79–90% of failures are genuine agent errors, not simulator artefacts. The benchmark is doing what we hoped: surfacing real, fast progress on real tasks.

Why voice agents need end-to-end evaluation

Voice is rapidly becoming a primary interface for agentic systems.

And yet today's evaluation landscape splits voice agents in half. Audio benchmarks measure conversational dynamics — does the model interrupt politely, yield gracefully, recognise a backchannel, sound natural under noise? — but rarely check whether the agent actually solved the caller's problem. Task-completion benchmarks, conversely, are well established in the text domain (τ-bench among them): they rigorously verify that the agent called the right tool, followed the right policy, and changed the database the right way — but they assume a clean text channel and never expose the agent to real audio.

The risk is that we ship voice agents that hold a charming conversation while quietly failing the underlying task — or, conversely, agents that nail the task in writing but fall apart the moment a real caller starts talking over them.

A customer calls to make changes to their account. Background noise from a busy street and an unfamiliar accent push the speech recogniser off, and authentication fails.

  1. 1Does the agent ask them to spell their name?
  2. 2If they spell it, does the agent transcribe it correctly?
  3. 3If so, does it actually fix the failed authentication call — or does it lose track of the corrections spread across three turns?

None of the standard benchmarks would catch a failure like this, because each step looks fine in isolation. Measuring both dimensions together — task completion and conversational dynamics, on the same call, under realistic audio — is what lets us see these integrated failures, quantify how much of an agent's text capability survives the move to voice, and surface regressions for the people most affected: speakers with non-standard accents, callers from noisy environments, users on degraded connections. Robustness to realistic audio conditions is an accessibility issue.

This is also a particularly timely thing to measure. Audio-native models — systems that ingest and produce speech end-to-end, without an intermediate transcript — are the next frontier of agentic AI: generally available from OpenAI, Google, and xAI, and improving fast. A benchmark sensitive to both task and conversation gives us a way to track exactly how quickly that frontier is moving.

Introducing τ-voice

τ-voice is the first benchmark to combine three things that have so far been evaluated in isolation:

One implementation note worth flagging: the major voice provider APIs (OpenAI Realtime, Gemini Live, xAI Grok) don't require a simulated call to play out in real time. We can run a session at whatever pace we want without changing what the agent hears, which means the user simulator isn't held to a realtime latency or token budget — we're free to pick whatever text LLM works best for simulating the caller (GPT-4.1 in our experiments). Strong simulator, precise control over turn-taking, and reproducible runs — we did not have to choose.

Fig 1 A simulated τ-voice call in the retail domain. The main timeline (top) shows six minutes of overlapping user and agent speech, mixed with realistic acoustic effects. Inset A decomposes the audio the agent actually receives — clean speech mixed with street noise, a car-horn burst, and a non-agent-directed remark. Inset B zooms into a turn-taking moment: the agent must decide what to ignore and when to yield.

The voice user simulator, end to end

Each tick, the simulator does four things in sequence: it generates the next caller utterance as text, synthesises it through a voice persona, mixes the synthesised speech with environmental audio (background noise, vocal tics, non-directed speech), and applies channel degradation (G.711 µ-law compression at 8 kHz, dynamic muffling, frame drops via a Gilbert-Elliott model). A separate LLM-driven turn-taking policy, evaluated every two seconds, decides whether to interrupt, yield, or backchannel.

Fig 2🎙 Audio generation pipeline
📚 Audio library Scheduler 🗣 Voice Personas 7 personas Simulated user message GENERATORS Background Noise Bursts Poisson Out-of-turn Speech Poisson Speech Dynamic Muffling Gilbert–Elliott CHANNEL DEGRADATION Telephony Conversion G.711 µ-law · 8 kHz Frame Drops Gilbert–Elliott to agent continuous intermittent
Each tick, four generators emit audio in parallel: Background Noise from a sound library (continuous), Bursts and Out-of-turn Speech scheduled by a Poisson process (intermittent), and the user's Speech synthesised through one of seven Voice Personas. Streams are mixed (⊕), passed through Dynamic Muffling (random fades simulating phone movement), then Channel Degradation: G.711 µ-law compression at 8 kHz and packet-loss Frame Drops modelled as a Gilbert–Elliott process. The result is the audio the agent actually receives. A separate LLM-driven turn-taking policy decides when the user interrupts, yields, or backchannels.

Voice models are improving fast — really fast

Because τ-voice inherits its tasks, tools, and evaluator from τ-bench, voice numbers can be plotted directly on the same axis as text. The chart below is the τ-voice progress timeline — the same view rendered by the live leaderboard's Progress over time panel — overlaid with two text reference lines: the current text reasoning ceiling (~85% pass@1, Gemini 3 Pro / GPT-5.2 / Claude Opus 4.5) and a strong non-reasoning text baseline (54%, GPT-4.1).

Fig 3τ-voice Overall pass@1 by release date
Pass@1 on τ-voice · overall · plotted at each model's public release date. 7 models. Text reference lines from the τ-bench text leaderboard.
Voice markers show each model's overall pass@1 (average of retail, airline, and telecom) plotted at its public release date. The dashed green line tracks the running best. The two horizontal grey lines are text-leaderboard reference points (reasoning ceiling and non-reasoning baseline) on the same 278 tasks.

In about eight months the voice frontier has moved from 30% (OpenAI's gpt-realtime-1.0, Aug 2025) to 67% (xAI's grok-voice-think-fast-1.0, Apr 2026), crossing the non-reasoning text line and closing most of the way to the reasoning ceiling. The biggest single move is the most recent one — a +29 pp jump in roughly two months, driven by xAI's reasoning-enabled audio-native model — and the pattern is familiar from text: adding explicit reasoning to the audio-native model unlocks a step change in tool-use reliability. Voice has gone from retaining roughly ~45% of text capability when the paper was written to ~79% today. Same domains, same evaluator, no asterisks.

To explore the full leaderboard — including per-domain breakdowns, custom submissions, the same Progress-over-time panel, and the underlying trajectories — jump straight to the τ-voice ranking (or the τ-bench text ranking for direct comparison). We've worked with every major audio-native provider so far; the chart above will keep moving.

What's actually going wrong

Before zooming into the failures themselves, it helps to see how each provider's pass@1 changes when we move from Clean (single persona, no acoustic effects, strict turn-taking) to Realistic (diverse personas, environmental noise, free-form turn-taking) on the paper-era models — the absolute drop varies, but every provider takes a hit.

Fig 4📊 Clean vs Realistic by provider paper · Feb 2026
Google gemini-live-2.5-flash-native-audio
Clean
31%
Realistic
26% −5 pp
OpenAI gpt-realtime-1.5
Clean
49%
Realistic
35% −14 pp
xAI grok-voice-fast-1.0
Clean
51%
Realistic
38% −13 pp
Google is the most robust to acoustic degradation, losing only ~17% of its clean performance vs. 24–28% for the others — even though its absolute scores are lowest. xAI tops both regimes overall. OpenAI wins one specific domain by a lot (Retail, 71% Clean — the single best per-domain score in the benchmark) but degrades the most when the audio gets messy.

Knowing each provider takes a hit is one thing; knowing which mistakes drive the hit is another. Two annotators labelled every failed simulation in two analysis cohorts — Voice-Fragile (tasks the text models pass but voice-Clean models fail) and Noise-Fragile (tasks voice-Clean passes but voice-Realistic fails) — tagging both the source and type of the first critical error.

Fig 5🔬 Manual error analysis  ·  91 failed simulations paper · Feb 2026
Voice-Fragile cohort — 43 failures
Tasks both text models pass, but a majority of voice-Clean providers fail. Isolates the cost of going voice-native.
Agent · 79%
User · 21%
Agent failures 34 of 43
Logical
1330%
Transcription
1023%
Hallucination
614%
Timeout
49%
VAD / unresponsive
12%
User-simulator failures 9 of 43
Logical
921%
Noise-Fragile cohort — 48 failures
Tasks voice-Clean passes but voice-Realistic fails. Isolates the marginal cost of realistic acoustic conditions.
Agent · 90%
User · 10%
Agent failures 43 of 48
Logical
1633%
Transcription
1633%
Hallucination
612%
VAD / unresponsive
48%
Timeout
12%
User-simulator failures 5 of 48
Early termination
48%
Logical
12%
Across both cohorts, 79–90% of failures are agent errors. In the Voice-Fragile cohort the most common single bucket is reasoning failure even when transcription is fine; in the Noise-Fragile cohort transcription failures and reasoning failures are tied. Authentication is the dominant bottleneck in both: agents fail to transcribe a name or email even when it's spelled letter by letter, and everything downstream falls over.
Source: τ-voice paper, Table 5 — manual error analysis on 91 simulations, two raters, 84% inter-rater agreement.

The four failure modes that matter most

Fig 6📋 Recurring failure patterns
🔤
Authentication transcription
The user spells m-e-i-p, the agent hears n-e-a-p, the lookup fails, and the agent loops. Voice-native models lose name/email letters even at the spelling step — the part designed to be unambiguous.
🧠
Lost-track-of-multi-step
The user asks for two things in one breath ("exchange this puzzle and change my address"). The agent handles the puzzle and never circles back. Reasoning errors dominate even when transcription is perfect.
💭
Hallucinated completion
In one simulation the agent calmly tells the user "I've updated your shipping address" — with no tool call. In voice this is harder to catch than in text, because there is no visible trace.
🤐
Goes silent on you
After repeated authentication failures or under heavy interruption, voice agents sometimes simply stop responding. The conversation hangs — on a real call, the customer hangs up.

How much does each ingredient of "realistic" compound these?

To quantify how much each part of "realistic" contributes to the failure modes above, we ran ablations on the retail domain — adding background noise, diverse accents, and turn-taking dynamics one factor at a time, on top of an otherwise clean condition.

Fig 7🧪 Ablation  ·  impact of each factor on Retail pass@1 paper · Feb 2026
Condition Google OpenAI xAI Average
Clean 45 71 48 55
+ Noise 40−4 67−4 46−2 51−4
+ Accents 44−1 60−11 30−18 44−10
+ Turn-taking 33−11 57−14 52+4 47−7
Realistic (all) 30−15 45−26 39−10 38−17
Accents are the most damaging factor on average (−10 pp), with strong per-provider variance — xAI loses 18 pp (38% of its clean capability) to accents alone, while Google is essentially unaffected (−1 pp). The two effects map cleanly onto the failure modes above: accent perturbations drive the authentication-transcription failures (the agent mishears spelled names letter by letter), and turn-taking dynamics push agents into the goes-silent mode and into interruption-driven reasoning errors. Because we induce accents through TTS personas, treat the absolute numbers as indicative — but the per-provider variance is large enough to suggest a real accessibility concern.
Source: τ-voice paper, Table 4 (Retail, ablations).
Listen to real τ-voice failures Same task, clean vs realistic conditions, side by side — with an annotated, playable speech-activity timeline.
Open audio examples →
🤝 Audio-native providers we've evaluated
OpenAI
OpenAI
gpt-realtime · 1.0 · 1.5
Google
Google
Gemini Live 2.5 / 3.1 Flash
xAI
xAI
Grok Voice Fast · Think Fast 1.0
The framework is provider-agnostic: τ-voice talks to any voice agent through a thin adapter (we already ship adapters for OpenAI Realtime, Gemini Live, xAI Grok Voice, and LiveKit-orchestrated stacks). If you build a voice agent or run a voice platform, implement an adapter and reach out — we'd love to add your system to the leaderboard.

What τ-voice does not (yet) measure

We are deliberate about scope. A few things τ-voice does simplify, and where we plan to go next:

Open, reproducible, and yours to build on

τ-voice is part of the broader τ-bench framework. Tasks, environment, voice user simulator, audio effects, turn-taking policy, and evaluation are all open source. Every result here is reproducible from a fixed seed (LLM stochasticity aside), every audio sample on the examples page comes from a real τ-voice run, and every voice submission on the leaderboard ships with its trajectories so you can replay the conversations end-to-end.

The official voice personas are held out, but a one-command script generates equivalent ones via the same ElevenLabs Voice Design API, so external developers can iterate locally and expect improvements to carry over to the official eval.

If you train voice models, evaluate voice agents, or just want to understand where today's systems break down, we'd love your contributions — new audio-native providers, cascaded ASR→LLM→TTS baselines, and pull requests against the user simulator are all welcome. Voice agents will be in production whether or not we measure them carefully. We'd rather measure them carefully.

For full details, see our paper, the code, and the leaderboard. The framework was built by Soham Ray, Keshav Dhandhania, and Victor Barres at Sierra, with Karthik Narasimhan at Princeton.

← Back to τ-bench Leaderboard