src — inference

Module: src-inference Cohesion: 0.80 Members: 0

src — inference

The src/inference module provides a suite of tools designed to optimize the performance of local Large Language Model (LLM) inference. It focuses on two critical areas: efficient management of the Key-Value (KV) cache and acceleration through speculative decoding.

This module aims to give developers fine-grained control over inference parameters, enabling them to maximize throughput and minimize latency when running LLMs on local hardware, particularly with backends like llama.cpp or LM Studio.

Module Structure

The inference module is composed of two primary sub-modules:

graph TD
    A[src/inference/index.ts] --> B[src/inference/kv-cache-config.ts]
    A --> C[src/inference/speculative-decoding.ts]

KV-Cache Configuration

The src/inference/kv-cache-config.ts module is responsible for managing and estimating the memory requirements of the Key-Value (KV) cache, a crucial component for LLM inference performance. The KV cache stores the keys and values of past attention layers, preventing recomputation and speeding up token generation, especially for longer contexts.

This module provides tools to configure parameters such as context length, quantization, and memory offloading, and to estimate their impact on memory usage.

Key Concepts

KVCacheManager Class

The KVCacheManager class is the central component for KV cache management. It extends EventEmitter to allow for configuration change notifications.

Initialization

import { KVCacheManager, DEFAULT_KV_CACHE_CONFIG } from './kv-cache-config.js';

const manager = new KVCacheManager({
  contextLength: 8192,
  kvQuantization: 'q8_0',
});
class="hl-cmt">// Or use the singleton instance
const manager = getKVCacheManager();

The constructor accepts a Partial to override DEFAULT_KV_CACHE_CONFIG. A singleton instance can be retrieved using getKVCacheManager().

Core Functionality

  1. setArchitecture(arch: ModelArchitecture | string): void

  1. estimateMemory(contextLength?: number, batchSize?: number): KVCacheEstimate
    graph TD
        A[KVCacheManager.estimateMemory()] --> B{Architecture Known?}
        B -- Yes --> C[Calculate based on ModelArchitecture]
        B -- No --> D[estimateGeneric()]
        C --> E[Calculate GPU/CPU Memory]
        D --> E
        E --> F[generateRecommendation()]
        F --> G[Return KVCacheEstimate]

  1. optimizeForVRAM(availableVRAMMB: number, modelSizeMB: number): KVCacheConfig

  1. generateLlamaCppArgs(): string[]

  1. generateLMStudioConfig(): Record

Configuration Management

Types and Constants

Singleton Access

The module provides getKVCacheManager() to ensure a single instance of KVCacheManager is used throughout the application, and resetKVCacheManager() for testing or re-initialization.


Speculative Decoding

The src/inference/speculative-decoding.ts module implements speculative decoding, an advanced technique to accelerate LLM inference. It leverages a smaller, faster "draft" model to propose a sequence of tokens, which are then quickly verified by the larger, more accurate "target" model. This can significantly reduce the total time required for autoregressive generation.

Key Concepts

SpeculativeDecoder Class

The SpeculativeDecoder class orchestrates the speculative decoding process. It also extends EventEmitter to provide updates on the generation process.

Initialization

import { SpeculativeDecoder, DEFAULT_SPECULATIVE_CONFIG } from './speculative-decoding.js';

const decoder = new SpeculativeDecoder({
  speculationLength: 5,
  minAcceptanceRate: 0.6,
});
class="hl-cmt">// Or use the singleton instance
const decoder = getSpeculativeDecoder();

The constructor takes a Partial to customize behavior, falling back to DEFAULT_SPECULATIVE_CONFIG. A singleton instance is available via getSpeculativeDecoder().

Core Functionality

  1. generate(prompt, maxTokens, draftCallback, targetCallback, onToken?): Promise<{ tokens: number[]; stats: SpeculativeStats }>
    graph TD
        A[SpeculativeDecoder.generate()] --> B{Loop until maxTokens or EOS}
        B --> C[draftCallback(currentPrompt, currentSpecLength)]
        C --> D[targetCallback(currentPrompt, draftTokens)]
        D --> E[updateStats(proposed, accepted)]
        E --> F[updatePrompt(currentPrompt, finalTokens)]
        F --> G{adaptiveLength enabled?}
        G -- Yes --> H[adaptSpeculationLength(accepted, proposed)]
        H --> B
        G -- No --> B
        B --> I[Return generated tokens and stats]

  1. updateStats(proposed: number, accepted: number): void

  1. adaptSpeculationLength(accepted: number, proposed: number): void

  1. shouldUseSpeculation(): boolean

  1. generateLlamaCppArgs(): string[]

Configuration and Statistics

Static Utility

Types and Constants

Mock Implementations

The module includes createMockDraftCallback() and createMockTargetCallback() for testing and development purposes. These functions simulate the behavior of actual LLM calls, allowing for easy testing of the SpeculativeDecoder logic without needing a live LLM server.

Singleton Access

Similar to KVCacheManager, getSpeculativeDecoder() provides a singleton instance, and resetSpeculativeDecoder() allows for cleanup and re-initialization. The dispose() method is called during resetSpeculativeDecoder() to clean up event listeners.


Integration and Usage

The inference module is designed to be integrated into applications that manage local LLM inference.

Both KVCacheManager and SpeculativeDecoder extend EventEmitter, allowing external components to subscribe to configuration updates (configUpdated) or generation events (draft, verify, complete). The logger utility from src/utils/logger.js is used internally for debugging and informational messages.

By combining the memory efficiency of KV-Cache configuration with the speed benefits of speculative decoding, this module provides a powerful foundation for building high-performance local LLM applications.