Research

τ-voice Paper

February 2026

τ-voice enables voice-based evaluation of conversational AI agent systems.

🔊 Sample Conversations  ·  Clean vs. Realistic
Same task, different conditions
Task 14 succeeds under clean audio but fails when realistic effects are applied — same task, same agent, different outcome.
Clean
Gemini Success
Realistic
Gemini Logical
Transcription failures
Both conversations fail due to transcription errors. In clean audio, verbally encoded characters trip up the agent; in realistic audio, accent and noise compound the problem.
Clean
xAI Transcription
Realistic
xAI Transcription
Logical failures
Both conversations fail due to reasoning errors — wrong policy application or missed constraints — independent of audio quality.
Clean
OpenAI Logical
Realistic
Gemini Logical

Annotated Speech Activity Timeline

The interactive visualization below annotates the realistic Task 14 audio with speech-activity markers — user & agent speech, interruptions, noise effects, backchannels, and more. Press play to step through the conversation with a synchronized playhead.

📊 Speech Activity Timeline — Retail, Gemini
0:00 / 0:00
User
(Busy Street)
Agent
Time (seconds)
← Back to τ-bench Leaderboard