τ-voice enables voice-based evaluation of conversational AI agent systems.
Sample Conversations · Clean vs. Realistic
Same task, different conditions
Task 14 succeeds under clean audio but fails when realistic effects are applied — same task, same agent, different outcome.
Clean
Realistic
Transcription failures
Both conversations fail due to transcription errors. In clean audio, verbally encoded characters trip up the agent; in realistic audio, accent and noise compound the problem.
Clean
Realistic
Logical failures
Both conversations fail due to reasoning errors — wrong policy application or missed constraints — independent of audio quality.
Clean
Realistic
Annotated Speech Activity Timeline
The interactive visualization below annotates the realistic Task 14 audio with speech-activity markers — user & agent speech, interruptions, noise effects, backchannels, and more. Press play to step through the conversation with a synchronized playhead.
Speech Activity Timeline — Retail, Gemini
0:00 / 0:00
User
(Busy Street)
(Busy Street)
Agent
Time (seconds)