src — voice
src — voice
The src/voice module provides the foundational capabilities for voice interaction within the application, encompassing wake word detection, speech-to-text (STT) recognition, voice activity detection (VAD), and a high-level pipeline for voice-to-code functionality. It is designed to be modular, allowing different providers and detection methods to be swapped or configured.
Core Concepts
The module is built around several key voice interaction concepts:
- Wake Word Detection (WWD): Identifies specific trigger phrases (e.g., "Hey Buddy") to activate the system.
- Speech Recognition (STT): Converts spoken audio into text. This module supports various cloud and local providers.
- Voice Activity Detection (VAD): Determines when speech is present in an audio stream, allowing the system to intelligently start and stop recording for STT.
- Voice-to-Code Pipeline: An orchestration layer that uses STT to transcribe speech, then analyzes the transcription to determine if it's a command or code dictation.
Module Structure
The src/voice directory is organized as follows:
index.ts: The main entry point, exporting all public interfaces, classes, and factory functions from the sub-modules.types.ts: Centralized type definitions and default configurations for all voice-related features.wake-word.ts: Implements theWakeWordDetectorfor wake word detection.speech-recognition.ts: Implements theSpeechRecognizerfor speech-to-text conversion.voice-activity.ts: Implements theVoiceActivityDetectorfor identifying speech segments.voice-to-code.ts: Provides theVoiceToCodePipelinefor higher-level voice command and dictation processing.
Key Components
WakeWordDetector (src/voice/wake-word.ts)
The WakeWordDetector is responsible for identifying predefined wake words in an incoming audio stream. It prioritizes using the Picovoice Porcupine engine for robust, local wake word detection, but gracefully falls back to a text-matching approach if a Picovoice access key is not provided or if Porcupine initialization fails.
Key Features:
- Engine Selection: Automatically attempts to initialize Porcupine if
PICOVOICE_ACCESS_KEYis available, otherwise defaults totext-match. Theenginecan also be explicitly configured. - Porcupine Integration: Dynamically imports
@picovoice/porcupine-nodeto process rawInt16Arrayaudio frames. It maps configured wake words to Porcupine's built-in keywords or uses custom.ppnfiles. - Text-Match Fallback: In
text-matchmode, it relies on transcribed text (from aSpeechRecognizer) to detect wake words by checking for substring matches. - Cooldown Mechanism: Prevents rapid, repeated detections of the same wake word within a short period.
- Configuration: Managed via
WakeWordConfig, allowing customization of wake words, sensitivity, and engine.
Usage:
import { createWakeWordDetector, DEFAULT_WAKE_WORD_CONFIG } from 39;./voice/index.js39;;
const detector = createWakeWordDetector({
wakeWords: [39;hey buddy39;],
accessKey: process.env.PICOVOICE_ACCESS_KEY,
});
detector.on(39;detected39;, (detection) => {
console.log(`Wake word detected: ${detection.wakeWord} at ${detection.timestamp}`);
});
await detector.start();
class="hl-cmt">// In Porcupine mode, feed raw audio frames:
class="hl-cmt">// detector.processFrame(audioFrameInt16Array);
class="hl-cmt">// In text-match mode, feed transcribed text:
class="hl-cmt">// detector.detectWakeWordText("hey buddy, how are you?");
SpeechRecognizer (src/voice/speech-recognition.ts)
The SpeechRecognizer converts spoken audio into text using various backend providers. It acts as an EventEmitter, emitting transcript events when speech is recognized.
Key Features:
- Multi-Provider Support: Configurable to use
whisper(OpenAI API or local CLI),google,azure, ordeepgramfor transcription. - Local Whisper Integration:
- Can execute the
whisperCLI tool locally (requireswhisperto be installed and inPATH). - Implements a dual-model strategy (
dualModelconfig) to select between a fast, smaller model (e.g.,base) for short utterances and a more accurate, larger model (e.g.,mediumorlarge) for longer recordings, optimizing for both speed and accuracy. - Falls back to the OpenAI Whisper API if the local CLI fails or is not preferred, and an
apiKeyis provided. - Audio Buffering: Collects audio chunks between
startListening()andstopListening()calls, then transcribes the accumulated audio. - Configuration: Managed via
SpeechRecognitionConfig, including provider, language, API keys, vocabulary hints, and duration limits. - Events: Emits
listening-started,listening-stopped,transcript(withTranscriptResult),error, andprocessing-complete.
Usage:
import { createSpeechRecognizer, DEFAULT_SPEECH_RECOGNITION_CONFIG } from 39;./voice/index.js39;;
const recognizer = createSpeechRecognizer({
provider: 39;whisper39;,
language: 39;en-US39;,
apiKey: process.env.OPENAI_API_KEY,
dualModel: {
enabled: true,
durationThreshold: 20, class="hl-cmt">// seconds
fastModel: 39;base39;,
accurateModel: 39;medium39;,
},
});
recognizer.on(39;transcript39;, (result) => {
if (result.isFinal) {
console.log(`Final transcript: ${result.text}`);
}
});
recognizer.on(39;error39;, (error) => {
console.error(39;Speech recognition error:39;, error);
});
await recognizer.startListening();
class="hl-cmt">// Feed audio chunks (e.g., from a microphone stream)
class="hl-cmt">// recognizer.processAudio(audioBuffer);
class="hl-cmt">// ...
await recognizer.stopListening(); class="hl-cmt">// Triggers transcription of buffered audio
Speech Recognition Call Flow
The transcribe method is central to the SpeechRecognizer, dispatching to the appropriate provider. The transcribeLocal method further illustrates the dual-model strategy and fallback logic.
graph TD
A[SpeechRecognizer.transcribe(audio)] --> B{config.provider?};
B -- whisper --> C[transcribeWithWhisper(audio)];
B -- google --> D[transcribeWithGoogle(audio)];
B -- azure --> E[transcribeWithAzure(audio)];
B -- deepgram --> F[transcribeWithDeepgram(audio)];
B -- local/default --> G[transcribeLocal(audio)];
G --> H{isLikelyWav(audio)?};
H -- Yes --> I[transcribeWithLocalWhisperCli(audio)];
I -- Success --> J[Return TranscriptResult];
I -- Failure --> K{config.apiKey?};
K -- Yes --> C;
K -- No --> L[Return empty TranscriptResult];
H -- No --> K;
I --> M[selectModelForDuration(audio.length)];
VoiceActivityDetector (src/voice/voice-activity.ts)
The VoiceActivityDetector analyzes incoming audio frames to determine if speech is present. It uses an energy-based detection method with adaptive thresholding to distinguish speech from background noise.
Key Features:
- Energy-Based Detection: Calculates the Root Mean Square (RMS) energy of audio frames.
- Adaptive Thresholding: Maintains a history of audio energy to dynamically adjust
noiseFloorandspeechThreshold, making it more resilient to varying noise environments. - Speech State Management: Tracks
speechStartTimeandsilenceStartTimeto confirm speech start and end events based on configured thresholds and durations (minSpeechDuration,maxSilenceDuration). - Events: Emits
speech-startandspeech-endevents when voice activity changes. - Configuration: Managed via
VADConfig, allowing fine-tuning of thresholds, padding, and durations.
Note: The current implementation uses a basic energy-based approach. For production-grade accuracy, integrating more advanced VAD libraries like WebRTC VAD or Silero VAD would be beneficial.
Usage:
import { createVADDetector, DEFAULT_VAD_CONFIG } from 39;./voice/index.js39;;
const vad = createVADDetector({
enabled: true,
speechStartThreshold: 0.6,
maxSilenceDuration: 1000,
});
vad.on(39;speech-start39;, (event) => {
console.log(`Speech started at ${event.positionMs}ms`);
});
vad.on(39;speech-end39;, (event) => {
console.log(`Speech ended at ${event.positionMs}ms`);
});
class="hl-cmt">// In a real scenario, audio frames would come from a microphone
class="hl-cmt">// For example, a 16kHz, 16-bit mono audio frame
class="hl-cmt">// vad.processFrame(audioFrameBuffer);
VoiceToCodePipeline (src/voice/voice-to-code.ts)
The VoiceToCodePipeline orchestrates the speech recognition process and adds an intent detection layer. It listens for transcriptions, then classifies them as either a "command" (e.g., "run tests") or "dictation" (e.g., code snippets).
Key Features:
- STT Orchestration: Dynamically imports and configures a
SpeechRecognizerbased on its ownsttProvidersetting. - Intent Detection: Uses a set of
COMMAND_PATTERNS(regular expressions) to classify transcribed text. - Graceful Degradation: If the underlying STT modules (e.g., local Whisper CLI, Picovoice) are not available or fail to initialize, it emits an
errorevent with helpful setup instructions. - Events: Emits
transcription(raw text),command(for detected commands),dictation(for code dictation),error, andstatus. - Configuration: Managed via
VoiceCodeConfig, includingsttProvider,language, andautoExecute.
Usage:
import { createVoiceToCodePipeline } from 39;./voice/index.js39;;
const pipeline = createVoiceToCodePipeline({
sttProvider: 39;whisper39;,
language: 39;en-US39;,
autoExecute: false,
});
pipeline.on(39;transcription39;, (text) => {
console.log(`Raw transcription: "${text}"`);
});
pipeline.on(39;command39;, (commandText) => {
console.log(`Detected command: "${commandText}" - Executing...`);
class="hl-cmt">// Logic to execute the command
});
pipeline.on(39;dictation39;, (dictationText) => {
console.log(`Detected dictation: "${dictationText}" - Inserting into editor...`);
class="hl-cmt">// Logic to insert dictation into code editor
});
pipeline.on(39;error39;, (error) => {
console.error(39;Voice-to-Code Pipeline Error:39;, error.message);
});
await pipeline.start();
class="hl-cmt">// The pipeline internally manages the SpeechRecognizer and its audio input.
class="hl-cmt">// You would typically feed audio to the underlying SpeechRecognizer instance
class="hl-cmt">// or a higher-level audio input module that connects to it.
class="hl-cmt">// For example, if a microphone stream is active, the SpeechRecognizer
class="hl-cmt">// would receive audio and emit transcripts.
Voice-to-Code Pipeline Flow
graph TD
A[VoiceToCodePipeline.start()] --> B{Dynamic Import SpeechRecognizer};
B -- Success --> C[SpeechRecognizer.startListening()];
C --> D[SpeechRecognizer.on('transcript')];
D --> E{isFinal?};
E -- Yes --> F[emit 'transcription'];
F --> G[detectIntent(text)];
G -- 'command' --> H[emit 'command'];
G -- 'dictation' --> I[emit 'dictation'];
B -- Failure --> J[getSetupInstructions()];
J --> K[emit 'error'];
Integration & Usage
The components in the src/voice module are designed to be integrated into a larger voice control system. For example, src/input/voice-control.ts and src/input/voice-input.ts are known consumers of these modules.
A typical integration flow might look like this:
- Audio Input: An audio input module (e.g., from
src/input/) captures raw microphone audio. - VAD: The
VoiceActivityDetectorprocesses audio frames to identify speech segments. - Wake Word: The
WakeWordDetector(in Porcupine mode) continuously processes audio frames to detect a wake word. If intext-matchmode, it would receive transcripts from theSpeechRecognizer. - Speech Recognition: When a wake word is detected, or a push-to-talk button is pressed, or VAD indicates speech, the
SpeechRecognizeris activated to transcribe the audio. - Voice-to-Code Pipeline: The
VoiceToCodePipelineconsumes theSpeechRecognizer's output, classifies the intent, and triggers appropriate actions (e.g., executing a command, inserting code).
Configuration (src/voice/types.ts)
The types.ts file defines all configuration interfaces and their default values, ensuring consistency and ease of customization across the voice module.
Key configuration interfaces include:
WakeWordConfig: ForWakeWordDetector.SpeechRecognitionConfig: ForSpeechRecognizer.VADConfig: ForVoiceActivityDetector.VoiceSessionConfig: A higher-level configuration that bundles the above for a complete voice session.AudioStreamConfig: Defines parameters for audio input streams.
Developers should refer to types.ts for a complete list of configurable options and their default values.
Error Handling & Fallbacks
The module is designed with robustness in mind:
- Dynamic Imports: Native dependencies like
@picovoice/porcupine-nodeare dynamically imported, allowing the module to load even if these dependencies are not installed, and then provide specific error messages. - Provider Fallbacks:
WakeWordDetectorfalls back totext-matchif Porcupine fails.SpeechRecognizer'slocalprovider falls back to OpenAI API if the local Whisper CLI is unavailable. - Setup Instructions: The
VoiceToCodePipelineprovides detailed setup instructions via itserrorevent if required STT components are missing, guiding developers on how to resolve common issues.