src — talk-mode
src — talk-mode
The src/talk-mode module provides a robust and extensible Text-to-Speech (TTS) system, designed to integrate various TTS providers and manage speech synthesis, caching, and queued playback. It acts as a central hub for all voice output functionality within the application.
Module Purpose
The primary goals of the talk-mode module are:
- Abstract TTS Providers: Offer a unified interface for interacting with different TTS services (e.g., OpenAI, ElevenLabs, Edge TTS, local engines).
- Manage Voices: Discover and manage available voices across all integrated providers.
- Efficient Synthesis: Provide mechanisms for caching synthesized audio to reduce API calls and latency.
- Queued Playback: Handle a queue of speech requests, allowing for prioritized and sequential playback.
- Configurability: Allow flexible configuration of providers, default voices, synthesis options, and queue behavior.
Architecture Overview
The talk-mode module follows a provider-based architecture, centered around the TTSManager class.
graph TD
A[Application Code] --> B(getTTSManager())
B --> C(TTSManager)
C -- "Delegates synthesis & voice listing" --> D{ITTSProvider Interface}
D --> E(OpenAITTSProvider)
D --> F(ElevenLabsProvider)
D --> G(EdgeTTSProvider)
D --> H(AudioReaderTTSProvider)
D --> I(MockTTSProvider)
C -- "Manages queue, cache, config" --> C
C -- "Emits events (synthesis, playback, queue)" --> A
C -- "Uses types from" --> J(types.ts)
TTSManager: The core class that orchestrates all TTS operations. It manages the lifecycle of TTS providers, selects the active provider, handles voice discovery, performs speech synthesis (including caching), and manages the playback queue. It also emits events for various stages of synthesis and playback.ITTSProvider: An interface that defines the contract for any TTS service integration. Each concrete TTS provider must implement this interface.- Concrete TTS Providers: Classes like
OpenAITTSProvider,ElevenLabsProvider,EdgeTTSProvider, andAudioReaderTTSProviderimplement theITTSProviderinterface, providing specific logic to interact with their respective TTS APIs or local binaries. AMockTTSProvideris also included for testing and development. types.ts: This file centralizes all type definitions, configurations, and default values used throughout the module, ensuring consistency and clarity.
Key Components
TTSManager (src/talk-mode/tts-manager.ts)
The TTSManager is the central component of the talk-mode module. It extends EventEmitter to provide a rich set of events for monitoring its state and operations.
Initialization and Lifecycle:
constructor(config?: Partial: Initializes the manager with a given configuration, merging with) DEFAULT_TALK_MODE_CONFIG.initialize():- Registers a
MockTTSProviderby default if no other providers are explicitly registered. - Initializes all configured and enabled providers by calling their
initialize()method. - Calls
selectBestProvider()to determine which provider will be used for synthesis. - Calls
loadVoices()to fetch and cache voices from the active provider. shutdown(): Stops any ongoing playback, clears the queue and cache, and callsshutdown()on all registered providers to release resources.
Provider Management:
registerProvider(provider: ITTSProvider): Adds a new TTS provider to the manager.selectBestProvider(): Automatically selects an active provider based on its availability and configuredpriority.getActiveProvider(): ITTSProvider | null: Returns the currently active provider.setActiveProvider(providerId: TTSProvider): Manually sets the active provider if it's available. This will trigger aloadVoices()call and emit aprovider-changeevent.listProviders(): Array<{ id: TTSProvider; available: boolean }>: Returns a list of all registered providers and their availability status.
Voice Management:
loadVoices(): Fetches voices from theactiveProviderand caches them internally.getVoices(): Voice[]: Returns all voices loaded from the active provider.getVoicesForLanguage(language: string): Voice[]: Filters voices by language.getVoice(id: string): Voice | undefined: Retrieves a specific voice by its ID.getDefaultVoice(): Voice | undefined: Returns the configured default voice or the first voice marked as default by the provider.
Speech Synthesis:
synthesize(text: string, options?: SynthesisOptions): Promise:- Delegates the actual synthesis to the
activeProvider. - Caching: If
config.cacheEnabledis true, it first checks an internal cache. If a cached result exists and is withincacheTTLMs, it's returned immediately. Otherwise, it performs synthesis and caches the result (evicting old entries ifcacheMaxBytesis exceeded). - Uses
crypto.createHashto generate cache keys based on text and options. clearCache(): Empties the synthesis cache.
Playback Queue Management:
speak(text: string, options?: SynthesisOptions): Promise:- Creates a
SpeechItemand adds it to the internal queue. - If
config.queueConfig.preSynthesizeis enabled, it attempts to synthesize the audio in the background. - If
config.queueConfig.autoPlayis enabled and nothing is currently playing, it triggersplayNext(). - Emits a
queue-changeevent. addToQueue(item: SpeechItem): Manages adding items to the queue, respectingmaxSizeandpriority.preSynthesize(item: SpeechItem): Asynchronously synthesizes audio for a queue item, updating its status and emittingsynthesis-start,synthesis-complete, orsynthesis-errorevents.getQueue(): SpeechItem[]: Returns a copy of the current speech queue.clearQueue(): Empties the queue.removeFromQueue(id: string): Removes a specific item from the queue.
Playback (Simulated):
playNext():- Dequeues the next
SpeechItem. - Ensures the item's audio is synthesized (calling
preSynthesizeif needed). - Simulates playback: Uses
setTimeoutto advancepositionMsand emitplayback-progressevents. - Emits
playback-start,playback-progress,playback-complete, orplayback-errorevents. - Includes a configurable
gapMsbetween items. - Automatically plays the next item if
autoPlayis enabled. stop(): Halts current playback, sets the item status back to 'pending', and re-adds it to the front of the queue.pause(): Pauses current playback.resume(): Resumes paused playback.getPlaybackState(): PlaybackState: Returns the current playback status, position, duration, etc.getCurrentItem(): SpeechItem | null: Returns the item currently being played.isCurrentlyPlaying(): boolean: Checks if audio is actively playing.
Configuration and Stats:
getConfig(): TalkModeConfig: Returns the current configuration.updateConfig(config: Partial: Updates the manager's configuration.) getStats(): Provides statistics like provider count, voice count, queue length, and cache size.
ITTSProvider Interface (src/talk-mode/tts-manager.ts)
This interface defines the contract that all TTS provider implementations must adhere to.
readonly id: TTSProvider: A unique identifier for the provider (e.g., 'openai', 'edge').isAvailable(): Promise: Checks if the provider is operational and accessible (e.g., API key valid, local binary found, service reachable).listVoices(): Promise: Retrieves a list of available voices from the provider.synthesize(text: string, options?: SynthesisOptions): Promise: Performs the core text-to-speech synthesis, returning audio data and metadata.initialize(config: TTSProviderConfig): Promise: Sets up the provider with its specific configuration.shutdown(): Promise: Cleans up any resources held by the provider.
Concrete TTS Providers (src/talk-mode/providers/)
The module includes several concrete implementations of ITTSProvider:
OpenAITTSProvider(openai-tts.ts):- Integrates with the OpenAI TTS API (
https://api.openai.com/v1/audio/speech). - Requires an
apiKey(can be provided via config orprocess.env.OPENAI_API_KEY). - Supports OpenAI's predefined voices (
alloy,echo,fable,onyx,nova,shimmer) and models (tts-1,tts-1-hd). - Handles
speedandresponse_formatoptions. ElevenLabsProvider(elevenlabs.ts):- Integrates with the ElevenLabs API (
https://api.elevenlabs.io/v1). - Requires an
apiKey(can be provided via config orprocess.env.ELEVENLABS_API_KEY). - Supports advanced features like
stability,similarityBoost,style, anduseSpeakerBoost. - Includes methods for
cloneVoice()anddeleteVoice()for managing custom voices. - Attempts to detect language and gender from ElevenLabs voice labels.
EdgeTTSProvider(edge-tts.ts):- Leverages the
edge-ttsPython CLI tool. - Dependency: Requires
edge-ttsto be installed (e.g.,pip install edge-tts). - Detection:
detectEdgeTTSCommand()attempts to find theedge-ttsexecutable orpython -m edge_tts. - Uses
child_process.spawnto execute the CLI for voice listing and synthesis. - Supports
rate,volume, andpitchadjustments via CLI arguments. AudioReaderTTSProvider(audioreader-tts.ts):- Connects to a local AudioReader API (e.g., Kokoro-82M engine) that exposes an OpenAI-compatible REST API.
- Configurable
baseURL,model,defaultVoice,speed, andformat. - Includes a mapping for known Kokoro voices and OpenAI voice names for compatibility.
MockTTSProvider(tts-manager.ts):- A simple, in-memory provider for testing and development.
- Simulates synthesis with a configurable delay and generates dummy audio buffers.
- Provides mock word timings.
types.ts (src/talk-mode/types.ts)
This file defines all the essential data structures and interfaces:
TTSProvider: Union type for all supported provider IDs.TTSProviderConfig: Base configuration for any provider, includingenabledstatus andpriority.- Specific Provider Configs: Interfaces like
OpenAITTSConfig,ElevenLabsConfig,EdgeTTSConfig,AudioReaderTTSConfig,PiperConfig,CoquiConfig,ESpeakConfig,SystemTTSConfigdefine provider-specific settings. Voice: Describes a TTS voice, includingid,name,language,gender,provider,providerId,quality, andsampleRate.SynthesisOptions: Parameters for a synthesis request (e.g.,voice,rate,format).SynthesisResult: The output of a synthesis operation, containingaudiodata (asBuffer),format,durationMs, and optionalwordTimings.SpeechItem: Represents an item in the playback queue, including itstext,options,status, andaudio(once synthesized).QueueConfig: Configuration for the speech queue (e.g.,maxSize,preSynthesize,autoPlay,gapMs).PlaybackState: Describes the current state of audio playback.TalkModeConfig: The top-level configuration for the entireTTSManager, encompassing provider configs, default options, queue config, and caching settings.TalkModeEvents: Defines the event signatures emitted byTTSManager, allowing external components to subscribe to synthesis, playback, and queue updates.- Default Configurations:
DEFAULT_QUEUE_CONFIGandDEFAULT_TALK_MODE_CONFIGprovide sensible defaults.
Integration with the Codebase
The talk-mode module is designed to be a core utility for any part of the application requiring spoken output.
Incoming Calls:
commands/cli/speak-command.ts: A CLI command likely usesgetTTSManager()to obtain the TTS instance and then callsspeak()to vocalize text provided by the user. It might also register specific providers likeAudioReaderTTSProviderorOpenAITTSProviderif they are not part of the default configuration.tests/talk-mode/tts.test.ts: This module is heavily tested, with unit tests coveringTTSManager's functionality, provider interactions, caching, queue management, and playback simulation.tests/features/plugins-commands-summarize.test.ts: Feature tests might interact withlistProviders()to ensure TTS capabilities are correctly reported.
Outgoing Calls:
- External APIs: Providers make HTTP requests to services like OpenAI and ElevenLabs using
fetch. - Local Processes:
EdgeTTSProvideruseschild_process.spawnto interact with theedge-ttsPython CLI. - Node.js Core Modules:
events(EventEmitter) for internal eventing.cryptofor generating cache keys.Bufferfor handling raw audio data.console.warn: Used byEdgeTTSProviderto inform developers if theedge-ttsexecutable is not found.
This module provides a comprehensive and flexible foundation for integrating various text-to-speech capabilities into the application, abstracting away the complexities of individual providers and offering robust management of speech synthesis and playback.