src — hardware

Module: src-hardware Cohesion: 0.80 Members: 0

src — hardware

The src/hardware module provides robust GPU monitoring capabilities, primarily focused on VRAM usage for local Large Language Model (LLM) inference. Its core component, GPUMonitor, offers real-time insights into GPU memory, utilization, and temperature, supporting various GPU vendors. This module is crucial for optimizing LLM performance by preventing Out-Of-Memory (OOM) errors through dynamic offloading recommendations.

Module Overview

The src/hardware module exports the GPUMonitor class and related utility functions and types. Its primary goal is to abstract away the complexities of querying different GPU hardware (NVIDIA, AMD, Apple Silicon, Intel) and provide a unified interface for monitoring and making informed decisions about LLM layer offloading.

Key Responsibilities:

Core Concepts

The module defines several interfaces and types to structure the data it handles:

The GPUMonitor Class

The GPUMonitor class is the central component of this module. It extends EventEmitter, allowing other parts of the application to subscribe to VRAM status updates and warnings.

Initialization and Vendor Detection

Before monitoring can begin, the GPUMonitor must be initialized to detect the available GPU hardware.

  1. constructor(config?: Partial): Initializes the monitor with default or provided configuration.
  2. async initialize(): Promise: This is the entry point for setting up the monitor. It calls detectGPUVendor() to identify the GPU type. If autoPoll is enabled in the config, it will then call startPolling().
  3. private async detectGPUVendor(): Promise: This method attempts to identify the GPU vendor by executing various system commands:

It prioritizes NVIDIA and AMD (ROCm) as they are common for ML workloads.

VRAM Monitoring and Data Collection

Once initialized, the monitor can query GPU statistics.

  1. async getStats(): Promise: This is the primary method to retrieve current VRAM statistics. It orchestrates the vendor-specific queries and aggregates the results into a VRAMStats object. It also calls checkThresholds() and caches the result in lastStats.
  2. private async queryGPUs(): Promise: This internal method acts as a dispatcher, calling the appropriate vendor-specific query function based on the detectedVendor.
  3. Vendor-Specific Query Methods:

getStats Execution Flow

graph TD
    A[GPUMonitor.getStats()] --> B{Detected Vendor?};
    B -- nvidia --> C[queryNVIDIA()];
    B -- amd --> D[queryAMD()];
    B -- apple --> E[queryApple()];
    B -- intel --> F[queryIntel()];
    B -- unknown --> G[queryGeneric()];
    C,D,E,F,G --> H[Aggregate GPUInfo into VRAMStats];
    H --> I[Check Thresholds];
    I --> J[Emit Events (vram:warning, vram:critical)];
    J --> K[Return VRAMStats];

Automatic Polling and Events

The monitor can be configured to automatically poll for VRAM updates at a set interval.

  1. startPolling(): void: Initiates a setInterval timer that periodically calls getStats() and emits a vram:update event with the latest VRAMStats.
  2. stopPolling(): void: Clears the polling timer, stopping automatic updates.
  3. private checkThresholds(stats: VRAMStats): void: Called by getStats(), this method compares the current usagePercent against warningThreshold and criticalThreshold from the configuration, emitting vram:warning or vram:critical events as appropriate.

Emitted Events:

Offloading Recommendations

A key feature for LLM inference is the ability to recommend how many model layers can safely reside on the GPU.

  1. calculateOffloadRecommendation(modelSizeMB: number, totalLayers: number, contextSize: number): OffloadRecommendation: This method takes model parameters (size, total layers, context size) and calculates an OffloadRecommendation. It estimates VRAM per layer (considering model weights and KV cache) and determines how many layers can fit within the safeVRAMLimit (total VRAM minus a configured safeBuffer).
  2. async getRecommendedLayers(modelSize: "3b" | "7b" | "13b" | "30b" | "70b"): Promise: A convenience method that uses predefined approximate model sizes and layer counts for common LLM sizes (e.g., "7b", "13b") to return a suggested number of GPU layers.

Reporting and Utilities

The monitor also provides methods for displaying its status.

  1. formatStats(): string: Generates a human-readable string summary of the last VRAM statistics, including a progress bar for each GPU.
  2. private createProgressBar(percent: number, width: number): string: An internal helper to generate an ASCII progress bar with color-coded emojis based on usage thresholds.

Configuration and Lifecycle Management

  1. updateConfig(config: Partial): void: Allows runtime modification of the monitor's configuration.
  2. getConfig(): GPUMonitorConfig: Returns the current configuration.
  3. getVendor(): GPUVendor: Returns the detected GPU vendor.
  4. getLastStats(): VRAMStats | null: Returns the last cached VRAM statistics.
  5. dispose(): void: Cleans up the monitor by stopping polling and removing all event listeners.

Singleton Management

The module provides helper functions to manage a singleton instance of GPUMonitor, ensuring consistent state across the application.

Integration with Other Modules

The GPUMonitor is designed to be a foundational service for other parts of the application that need hardware awareness, particularly for LLM inference.

src/models/model-hub.ts

The model-hub.ts module, responsible for managing LLM models, directly interacts with the GPUMonitor to make intelligent decisions:

This integration ensures that LLM loading and execution are optimized for the specific hardware environment, reducing the risk of OOM errors and improving performance.

Usage Example

import { initializeGPUMonitor, getGPUMonitor, GPUMonitorConfig } from "./hardware/gpu-monitor.js";

async function main() {
  class="hl-cmt">// Initialize the monitor (detects GPU, starts polling if autoPoll is true)
  const monitor = await initializeGPUMonitor({
    autoPoll: true,
    pollInterval: 2000, class="hl-cmt">// Poll every 2 seconds
    warningThreshold: 70,
    criticalThreshold: 90,
    safeBuffer: 1024, class="hl-cmt">// Keep 1GB free
  });

  console.log(`Detected GPU Vendor: ${monitor.getVendor()}`);

  class="hl-cmt">// Subscribe to VRAM updates
  monitor.on("vram:update", (stats) => {
    console.log(`VRAM Update: ${stats.usagePercent.toFixed(1)}% used`);
    class="hl-cmt">// console.log(monitor.formatStats()); // Uncomment for detailed output
  });

  class="hl-cmt">// Subscribe to warning/critical events
  monitor.on("vram:warning", (stats) => {
    console.warn(`🚨 VRAM Warning: ${stats.usagePercent.toFixed(1)}% used!`);
  });

  monitor.on("vram:critical", (stats) => {
    console.error(`🔥 VRAM CRITICAL: ${stats.usagePercent.toFixed(1)}% used! Immediate action needed.`);
  });

  class="hl-cmt">// Get current stats immediately
  const currentStats = await monitor.getStats();
  console.log("\nInitial GPU Status:");
  console.log(monitor.formatStats());

  class="hl-cmt">// Calculate offloading recommendation for a 7B model (approx 4000MB, 32 layers)
  const modelSizeMB = 4000; class="hl-cmt">// e.g., 7B Q4
  const totalLayers = 32;
  const contextSize = 4096;

  const recommendation = monitor.calculateOffloadRecommendation(modelSizeMB, totalLayers, contextSize);
  console.log("\nOffloading Recommendation for 7B Model:");
  console.log(`  Should Offload: ${recommendation.shouldOffload}`);
  console.log(`  Suggested GPU Layers: ${recommendation.suggestedGpuLayers}/${recommendation.maxGpuLayers}`);
  console.log(`  Reason: ${recommendation.reason}`);
  console.log(`  Estimated VRAM Usage: ${recommendation.estimatedVRAMUsage.toFixed(0)}MB`);
  console.log(`  Safe VRAM Limit: ${recommendation.safeVRAMLimit}MB`);

  class="hl-cmt">// Get recommended layers for a common model size
  const recommendedLayers7B = await monitor.getRecommendedLayers("7b");
  console.log(`\nRecommended layers for a '7b' model: ${recommendedLayers7B}`);

  class="hl-cmt">// Simulate some work...
  await new Promise(resolve => setTimeout(resolve, 10000));

  class="hl-cmt">// Stop polling and dispose when done
  monitor.dispose();
  console.log("\nGPU Monitor disposed.");
}

main().catch(console.error);