VOICE MODE

NATURAL VOICE CONVERSATIONS FOR AI ASSISTANTS VIA MCP
VOICE IS THE MOST NATURAL HUMAN INTERFACE. CODE SHOULD SPEAK. CODE SHOULD LISTEN.

THESIS

Voice Mode transforms AI assistants from text-based tools into conversational partners. Through the Model Context Protocol, we enable Claude, ChatGPT, and other LLMs to engage in natural voice interactions.

No more typing. No more reading. Just conversation.

PRINCIPLES

UNIVERSALITY
Works with any MCP-compatible client. No vendor lock-in.
SIMPLICITY
One command to install. One command to run. Zero configuration required.
LOCALITY
Your voice never leaves your machine unless you choose cloud services.
OPENNESS
MIT licensed. Fork it. Modify it. Make it yours.

ARCHITECTURE

TRANSPORT LAYER
LOCAL MICROPHONE → AUDIO CAPTURE → STT SERVICE → TEXT
TEXT → TTS SERVICE → AUDIO SYNTHESIS → SPEAKER OUTPUT

PROTOCOL LAYER
MCP CLIENT ↔ VOICE MODE SERVER ↔ OPENAI-COMPATIBLE API

SERVICE LAYER
WHISPER.CPP (STT) | KOKORO (TTS) | LIVEKIT (RTC)

TECHNICAL SPECIFICATION

SYSTEM REQUIREMENTS
PLATFORM: Linux, macOS, Windows (WSL)
RUNTIME: Python 3.10+
MEMORY: 512MB minimum
NETWORK: Internet connection (for cloud services)

DEPENDENCIES
pyaudio >= 0.2.11
openai >= 1.0.0
mcp >= 1.0.0
livekit >= 0.17.5 (optional)

API COMPATIBILITY
STT: OpenAI Whisper API v1
TTS: OpenAI TTS API v1
PROTOCOL: Model Context Protocol 2024.11

TOOL INTERFACE

converse(message, wait_for_response=True)
listen_for_speech(duration=15.0)
check_room_status()
check_audio_devices()
voice_status()
list_tts_voices(provider=None)
kokoro_start(models_dir=None)
kokoro_stop()
kokoro_status()

CONFIGURATION VARIABLES

OPENAI_API_KEY # Required for cloud services
STT_BASE_URL # Custom STT endpoint
STT_API_KEY # STT authentication
STT_MODEL # Whisper model selection
TTS_BASE_URL # Custom TTS endpoint
TTS_API_KEY # TTS authentication
TTS_MODEL # TTS model selection
TTS_VOICE # Voice selection
VOICE_MODE_DEBUG # Enable debug logging
VOICE_MODE_SAVE_AUDIO # Save audio files
VOICE_MODE_AUDIO_DIR # Audio save directory

INSTALLATION

Three methods. Choose one.

METHOD 1: CLAUDE CODE
$ claude mcp add --scope user voice-mode uvx voice-mode

METHOD 2: UV
$ uvx voice-mode

METHOD 3: PIP
$ pip install voice-mode

LOCAL VOICE STACK

Run everything on your machine. No cloud dependencies.

WHISPER.CPP (PORT 2022)
$ make whisper-start
Local speech-to-text with OpenAI-compatible API

KOKORO TTS (PORT 8880)
$ make kokoro-start
Local text-to-speech with multiple voice options

LIVEKIT (PORT 7880)
$ make livekit-start
Real-time communication for room-based voice

INTEGRATION

CLAUDE DESKTOP
1. Install Voice Mode via Claude Code
2. Start Claude Desktop
3. Use /converse command

CUSTOM MCP CLIENT
1. Add voice-mode to MCP server list
2. Configure transport (stdio/sse)
3. Call voice tools via MCP protocol

USAGE PATTERNS

CONVERSATIONAL MODE
converse("Hello, how are you?")
# Speaks message, waits for response

STATEMENT MODE
converse("Goodbye!", wait_for_response=False)
# Speaks message, no waiting

LISTENING MODE
response = listen_for_speech(duration=30)
# Pure listening, returns transcribed text

EMOTIONAL SPEECH
converse("Great job!",
  tts_model="gpt-4o-mini-tts",
  tts_instructions="Sound excited")
# Requires VOICE_ALLOW_EMOTIONS=true

DIAGNOSTICS

CHECK SYSTEM STATUS
voice_status()
# Returns comprehensive service health

LIST AUDIO DEVICES
check_audio_devices()
# Shows available input/output devices

DEBUG MODE
export VOICE_MODE_DEBUG=true
# Enables verbose logging

DEMONSTRATION

Watch Voice Mode in action: Demo Video

Read the complete documentation: GitHub Repository

Join the conversation: Discord Community