src — voice

Module: src-voice Cohesion: 0.80 Members: 0

src — voice

The src/voice module provides the foundational capabilities for voice interaction within the application, encompassing wake word detection, speech-to-text (STT) recognition, voice activity detection (VAD), and a high-level pipeline for voice-to-code functionality. It is designed to be modular, allowing different providers and detection methods to be swapped or configured.

Core Concepts

The module is built around several key voice interaction concepts:

Module Structure

The src/voice directory is organized as follows:

Key Components

WakeWordDetector (src/voice/wake-word.ts)

The WakeWordDetector is responsible for identifying predefined wake words in an incoming audio stream. It prioritizes using the Picovoice Porcupine engine for robust, local wake word detection, but gracefully falls back to a text-matching approach if a Picovoice access key is not provided or if Porcupine initialization fails.

Key Features:

Usage:

import { createWakeWordDetector, DEFAULT_WAKE_WORD_CONFIG } from './voice/index.js';

const detector = createWakeWordDetector({
  wakeWords: ['hey buddy'],
  accessKey: process.env.PICOVOICE_ACCESS_KEY,
});

detector.on('detected', (detection) => {
  console.log(`Wake word detected: ${detection.wakeWord} at ${detection.timestamp}`);
});

await detector.start();
class="hl-cmt">// In Porcupine mode, feed raw audio frames:
class="hl-cmt">// detector.processFrame(audioFrameInt16Array);
class="hl-cmt">// In text-match mode, feed transcribed text:
class="hl-cmt">// detector.detectWakeWordText("hey buddy, how are you?");

SpeechRecognizer (src/voice/speech-recognition.ts)

The SpeechRecognizer converts spoken audio into text using various backend providers. It acts as an EventEmitter, emitting transcript events when speech is recognized.

Key Features:

Usage:

import { createSpeechRecognizer, DEFAULT_SPEECH_RECOGNITION_CONFIG } from './voice/index.js';

const recognizer = createSpeechRecognizer({
  provider: 'whisper',
  language: 'en-US',
  apiKey: process.env.OPENAI_API_KEY,
  dualModel: {
    enabled: true,
    durationThreshold: 20, class="hl-cmt">// seconds
    fastModel: 'base',
    accurateModel: 'medium',
  },
});

recognizer.on('transcript', (result) => {
  if (result.isFinal) {
    console.log(`Final transcript: ${result.text}`);
  }
});

recognizer.on('error', (error) => {
  console.error('Speech recognition error:', error);
});

await recognizer.startListening();
class="hl-cmt">// Feed audio chunks (e.g., from a microphone stream)
class="hl-cmt">// recognizer.processAudio(audioBuffer);
class="hl-cmt">// ...
await recognizer.stopListening(); class="hl-cmt">// Triggers transcription of buffered audio

Speech Recognition Call Flow

The transcribe method is central to the SpeechRecognizer, dispatching to the appropriate provider. The transcribeLocal method further illustrates the dual-model strategy and fallback logic.

graph TD
    A[SpeechRecognizer.transcribe(audio)] --> B{config.provider?};
    B -- whisper --> C[transcribeWithWhisper(audio)];
    B -- google --> D[transcribeWithGoogle(audio)];
    B -- azure --> E[transcribeWithAzure(audio)];
    B -- deepgram --> F[transcribeWithDeepgram(audio)];
    B -- local/default --> G[transcribeLocal(audio)];

    G --> H{isLikelyWav(audio)?};
    H -- Yes --> I[transcribeWithLocalWhisperCli(audio)];
    I -- Success --> J[Return TranscriptResult];
    I -- Failure --> K{config.apiKey?};
    K -- Yes --> C;
    K -- No --> L[Return empty TranscriptResult];
    H -- No --> K;

    I --> M[selectModelForDuration(audio.length)];

VoiceActivityDetector (src/voice/voice-activity.ts)

The VoiceActivityDetector analyzes incoming audio frames to determine if speech is present. It uses an energy-based detection method with adaptive thresholding to distinguish speech from background noise.

Key Features:

Note: The current implementation uses a basic energy-based approach. For production-grade accuracy, integrating more advanced VAD libraries like WebRTC VAD or Silero VAD would be beneficial.

Usage:

import { createVADDetector, DEFAULT_VAD_CONFIG } from './voice/index.js';

const vad = createVADDetector({
  enabled: true,
  speechStartThreshold: 0.6,
  maxSilenceDuration: 1000,
});

vad.on('speech-start', (event) => {
  console.log(`Speech started at ${event.positionMs}ms`);
});

vad.on('speech-end', (event) => {
  console.log(`Speech ended at ${event.positionMs}ms`);
});

class="hl-cmt">// In a real scenario, audio frames would come from a microphone
class="hl-cmt">// For example, a 16kHz, 16-bit mono audio frame
class="hl-cmt">// vad.processFrame(audioFrameBuffer);

VoiceToCodePipeline (src/voice/voice-to-code.ts)

The VoiceToCodePipeline orchestrates the speech recognition process and adds an intent detection layer. It listens for transcriptions, then classifies them as either a "command" (e.g., "run tests") or "dictation" (e.g., code snippets).

Key Features:

Usage:

import { createVoiceToCodePipeline } from './voice/index.js';

const pipeline = createVoiceToCodePipeline({
  sttProvider: 'whisper',
  language: 'en-US',
  autoExecute: false,
});

pipeline.on('transcription', (text) => {
  console.log(`Raw transcription: "${text}"`);
});

pipeline.on('command', (commandText) => {
  console.log(`Detected command: "${commandText}" - Executing...`);
  class="hl-cmt">// Logic to execute the command
});

pipeline.on('dictation', (dictationText) => {
  console.log(`Detected dictation: "${dictationText}" - Inserting into editor...`);
  class="hl-cmt">// Logic to insert dictation into code editor
});

pipeline.on('error', (error) => {
  console.error('Voice-to-Code Pipeline Error:', error.message);
});

await pipeline.start();
class="hl-cmt">// The pipeline internally manages the SpeechRecognizer and its audio input.
class="hl-cmt">// You would typically feed audio to the underlying SpeechRecognizer instance
class="hl-cmt">// or a higher-level audio input module that connects to it.
class="hl-cmt">// For example, if a microphone stream is active, the SpeechRecognizer
class="hl-cmt">// would receive audio and emit transcripts.

Voice-to-Code Pipeline Flow

graph TD
    A[VoiceToCodePipeline.start()] --> B{Dynamic Import SpeechRecognizer};
    B -- Success --> C[SpeechRecognizer.startListening()];
    C --> D[SpeechRecognizer.on('transcript')];
    D --> E{isFinal?};
    E -- Yes --> F[emit 'transcription'];
    F --> G[detectIntent(text)];
    G -- 'command' --> H[emit 'command'];
    G -- 'dictation' --> I[emit 'dictation'];
    B -- Failure --> J[getSetupInstructions()];
    J --> K[emit 'error'];

Integration & Usage

The components in the src/voice module are designed to be integrated into a larger voice control system. For example, src/input/voice-control.ts and src/input/voice-input.ts are known consumers of these modules.

A typical integration flow might look like this:

  1. Audio Input: An audio input module (e.g., from src/input/) captures raw microphone audio.
  2. VAD: The VoiceActivityDetector processes audio frames to identify speech segments.
  3. Wake Word: The WakeWordDetector (in Porcupine mode) continuously processes audio frames to detect a wake word. If in text-match mode, it would receive transcripts from the SpeechRecognizer.
  4. Speech Recognition: When a wake word is detected, or a push-to-talk button is pressed, or VAD indicates speech, the SpeechRecognizer is activated to transcribe the audio.
  5. Voice-to-Code Pipeline: The VoiceToCodePipeline consumes the SpeechRecognizer's output, classifies the intent, and triggers appropriate actions (e.g., executing a command, inserting code).

Configuration (src/voice/types.ts)

The types.ts file defines all configuration interfaces and their default values, ensuring consistency and ease of customization across the voice module.

Key configuration interfaces include:

Developers should refer to types.ts for a complete list of configurable options and their default values.

Error Handling & Fallbacks

The module is designed with robustness in mind: