VOICE IS THE MOST NATURAL HUMAN INTERFACE.
CODE SHOULD SPEAK.
CODE SHOULD LISTEN.
THESIS
Voice Mode transforms AI assistants from text-based tools into conversational partners.
Through the Model Context Protocol, we enable Claude, ChatGPT, and other LLMs to engage in natural voice interactions.
No more typing. No more reading. Just conversation.
PRINCIPLES
UNIVERSALITY
Works with any MCP-compatible client. No vendor lock-in.
SIMPLICITY
One command to install. One command to run. Zero configuration required.
LOCALITY
Your voice never leaves your machine unless you choose cloud services.
OPENNESS
MIT licensed. Fork it. Modify it. Make it yours.
ARCHITECTURE
LOCAL MICROPHONE → AUDIO CAPTURE → STT SERVICE → TEXT
TEXT → TTS SERVICE → AUDIO SYNTHESIS → SPEAKER OUTPUT
MCP CLIENT ↔ VOICE MODE SERVER ↔ OPENAI-COMPATIBLE API
WHISPER.CPP (STT) | KOKORO (TTS) | LIVEKIT (RTC)
TECHNICAL SPECIFICATION
PLATFORM: Linux, macOS, Windows (WSL)
RUNTIME: Python 3.10+
MEMORY: 512MB minimum
NETWORK: Internet connection (for cloud services)
pyaudio >= 0.2.11
openai >= 1.0.0
mcp >= 1.0.0
livekit >= 0.17.5 (optional)
STT: OpenAI Whisper API v1
TTS: OpenAI TTS API v1
PROTOCOL: Model Context Protocol 2024.11
TOOL INTERFACE
converse(message, wait_for_response=True)
listen_for_speech(duration=15.0)
check_room_status()
check_audio_devices()
voice_status()
list_tts_voices(provider=None)
kokoro_start(models_dir=None)
kokoro_stop()
kokoro_status()
CONFIGURATION VARIABLES
OPENAI_API_KEY # Required for cloud services
STT_BASE_URL # Custom STT endpoint
STT_API_KEY # STT authentication
STT_MODEL # Whisper model selection
TTS_BASE_URL # Custom TTS endpoint
TTS_API_KEY # TTS authentication
TTS_MODEL # TTS model selection
TTS_VOICE # Voice selection
VOICE_MODE_DEBUG # Enable debug logging
VOICE_MODE_SAVE_AUDIO # Save audio files
VOICE_MODE_AUDIO_DIR # Audio save directory
INSTALLATION
Three methods. Choose one.
$ claude mcp add --scope user voice-mode uvx voice-mode
$ uvx voice-mode
$ pip install voice-mode
LOCAL VOICE STACK
Run everything on your machine. No cloud dependencies.
$ make whisper-start
Local speech-to-text with OpenAI-compatible API
$ make kokoro-start
Local text-to-speech with multiple voice options
$ make livekit-start
Real-time communication for room-based voice
INTEGRATION
1. Install Voice Mode via Claude Code
2. Start Claude Desktop
3. Use /converse command
1. Add voice-mode to MCP server list
2. Configure transport (stdio/sse)
3. Call voice tools via MCP protocol
USAGE PATTERNS
converse("Hello, how are you?")
# Speaks message, waits for response
converse("Goodbye!", wait_for_response=False)
# Speaks message, no waiting
response = listen_for_speech(duration=30)
# Pure listening, returns transcribed text
converse("Great job!",
tts_model="gpt-4o-mini-tts",
tts_instructions="Sound excited")
# Requires VOICE_ALLOW_EMOTIONS=true
DIAGNOSTICS
voice_status()
# Returns comprehensive service health
check_audio_devices()
# Shows available input/output devices
export VOICE_MODE_DEBUG=true
# Enables verbose logging