src — hardware
src — hardware
The src/hardware module provides robust GPU monitoring capabilities, primarily focused on VRAM usage for local Large Language Model (LLM) inference. Its core component, GPUMonitor, offers real-time insights into GPU memory, utilization, and temperature, supporting various GPU vendors. This module is crucial for optimizing LLM performance by preventing Out-Of-Memory (OOM) errors through dynamic offloading recommendations.
Module Overview
The src/hardware module exports the GPUMonitor class and related utility functions and types. Its primary goal is to abstract away the complexities of querying different GPU hardware (NVIDIA, AMD, Apple Silicon, Intel) and provide a unified interface for monitoring and making informed decisions about LLM layer offloading.
Key Responsibilities:
- GPU Vendor Detection: Automatically identifies the underlying GPU hardware.
- VRAM Monitoring: Gathers real-time statistics on total, used, and free VRAM.
- Performance Metrics: Collects GPU utilization, temperature, and power draw where available.
- Threshold Alerts: Emits events when VRAM usage crosses warning or critical levels.
- Offloading Recommendations: Calculates optimal GPU layer counts for LLMs based on available VRAM.
- Multi-GPU Support: Aggregates statistics across multiple detected GPUs.
Core Concepts
The module defines several interfaces and types to structure the data it handles:
GPUVendor: A union type ("nvidia" | "amd" | "intel" | "apple" | "unknown") representing the detected GPU manufacturer.GPUInfo: Detailed information for a single GPU, includingid,name,vendor,vramTotal,vramUsed,vramFree,utilization,temperature, andpowerDraw.VRAMStats: Aggregated VRAM statistics across all detected GPUs, includingtotalVRAM,usedVRAM,freeVRAM,usagePercent,gpuCount, and an array ofGPUInfoobjects.OffloadRecommendation: Provides guidance on how many LLM layers to offload to the GPU, includingshouldOffload,suggestedGpuLayers,maxGpuLayers,reason,estimatedVRAMUsage, andsafeVRAMLimit.GPUMonitorConfig: Configuration options for the monitor, such aspollInterval,warningThreshold,criticalThreshold,autoPoll, andsafeBuffer. TheDEFAULT_GPU_MONITOR_CONFIGprovides sensible defaults.
The GPUMonitor Class
The GPUMonitor class is the central component of this module. It extends EventEmitter, allowing other parts of the application to subscribe to VRAM status updates and warnings.
Initialization and Vendor Detection
Before monitoring can begin, the GPUMonitor must be initialized to detect the available GPU hardware.
constructor(config?: Partial: Initializes the monitor with default or provided configuration.) async initialize(): Promise: This is the entry point for setting up the monitor. It callsdetectGPUVendor()to identify the GPU type. IfautoPollis enabled in the config, it will then callstartPolling().private async detectGPUVendor(): Promise: This method attempts to identify the GPU vendor by executing various system commands:
nvidia-smi --versionfor NVIDIA.rocm-smi --versionorls /sys/class/drm/card*/device/vendorfor AMD.ls /sys/class/drm/card*/device/vendorfor Intel.sysctl -n machdep.cpu.brand_stringfor Apple Silicon.
It prioritizes NVIDIA and AMD (ROCm) as they are common for ML workloads.
VRAM Monitoring and Data Collection
Once initialized, the monitor can query GPU statistics.
async getStats(): Promise: This is the primary method to retrieve current VRAM statistics. It orchestrates the vendor-specific queries and aggregates the results into aVRAMStatsobject. It also callscheckThresholds()and caches the result inlastStats.private async queryGPUs(): Promise: This internal method acts as a dispatcher, calling the appropriate vendor-specific query function based on thedetectedVendor.- Vendor-Specific Query Methods:
private async queryNVIDIA(): Promise: Usesnvidia-smiwith a specific query format (--query-gpu=... --format=csv,noheader,nounits) to parse detailed GPU information.private async queryAMD(): Promise: First attempts to userocm-smi --showmeminfo vram --json. Ifrocm-smiis not available or fails, it falls back to reading VRAM information directly from/sys/class/drm/card*/device/files (Linux sysfs).private async queryApple(): Promise: For Apple Silicon, which uses unified memory, it estimates GPU VRAM by queryinghw.memsize(total RAM) andmemory_pressureto gauge overall memory usage. It assumes the GPU can utilize a significant portion of system RAM.private async queryIntel(): Promise: For Intel integrated GPUs, it estimates available VRAM based on system RAM (/proc/meminfo) and a conservative assumption of 1-2GB dedicated to the iGPU.private async queryGeneric(): Promise: A fallback method that returns a conservative estimate (e.g., 4GB total VRAM) if no specific vendor is detected or queried successfully.
getStats Execution Flow
graph TD
A[GPUMonitor.getStats()] --> B{Detected Vendor?};
B -- nvidia --> C[queryNVIDIA()];
B -- amd --> D[queryAMD()];
B -- apple --> E[queryApple()];
B -- intel --> F[queryIntel()];
B -- unknown --> G[queryGeneric()];
C,D,E,F,G --> H[Aggregate GPUInfo into VRAMStats];
H --> I[Check Thresholds];
I --> J[Emit Events (vram:warning, vram:critical)];
J --> K[Return VRAMStats];
Automatic Polling and Events
The monitor can be configured to automatically poll for VRAM updates at a set interval.
startPolling(): void: Initiates asetIntervaltimer that periodically callsgetStats()and emits avram:updateevent with the latestVRAMStats.stopPolling(): void: Clears the polling timer, stopping automatic updates.private checkThresholds(stats: VRAMStats): void: Called bygetStats(), this method compares the currentusagePercentagainstwarningThresholdandcriticalThresholdfrom the configuration, emittingvram:warningorvram:criticalevents as appropriate.
Emitted Events:
vram:update: (stats:VRAMStats) — Emitted on every successful poll.vram:warning: (stats:VRAMStats) — Emitted when VRAM usage exceeds thewarningThreshold.vram:critical: (stats:VRAMStats) — Emitted when VRAM usage exceeds thecriticalThreshold.
Offloading Recommendations
A key feature for LLM inference is the ability to recommend how many model layers can safely reside on the GPU.
calculateOffloadRecommendation(modelSizeMB: number, totalLayers: number, contextSize: number): OffloadRecommendation: This method takes model parameters (size, total layers, context size) and calculates anOffloadRecommendation. It estimates VRAM per layer (considering model weights and KV cache) and determines how many layers can fit within thesafeVRAMLimit(total VRAM minus a configuredsafeBuffer).async getRecommendedLayers(modelSize: "3b" | "7b" | "13b" | "30b" | "70b"): Promise: A convenience method that uses predefined approximate model sizes and layer counts for common LLM sizes (e.g., "7b", "13b") to return a suggested number of GPU layers.
Reporting and Utilities
The monitor also provides methods for displaying its status.
formatStats(): string: Generates a human-readable string summary of the last VRAM statistics, including a progress bar for each GPU.private createProgressBar(percent: number, width: number): string: An internal helper to generate an ASCII progress bar with color-coded emojis based on usage thresholds.
Configuration and Lifecycle Management
updateConfig(config: Partial: Allows runtime modification of the monitor's configuration.): void getConfig(): GPUMonitorConfig: Returns the current configuration.getVendor(): GPUVendor: Returns the detected GPU vendor.getLastStats(): VRAMStats | null: Returns the last cached VRAM statistics.dispose(): void: Cleans up the monitor by stopping polling and removing all event listeners.
Singleton Management
The module provides helper functions to manage a singleton instance of GPUMonitor, ensuring consistent state across the application.
getGPUMonitor(config?: Partial: Returns the singleton): GPUMonitor GPUMonitorinstance. If one doesn't exist, it creates it.async initializeGPUMonitor(config?: Partial: A convenience function to get the singleton instance and then call its): Promise initialize()method. This is the recommended way to start the monitor.resetGPUMonitor(): void: Disposes of the current singleton instance and sets it tonull, allowing a new instance to be created on the nextgetGPUMonitorcall. This is useful for testing or re-initializing with different configurations.
Integration with Other Modules
The GPUMonitor is designed to be a foundational service for other parts of the application that need hardware awareness, particularly for LLM inference.
src/models/model-hub.ts
The model-hub.ts module, responsible for managing LLM models, directly interacts with the GPUMonitor to make intelligent decisions:
getRecommendedModel: UsesgetGPUMonitor,initialize, andgetStatsto understand available VRAM and recommend suitable models or configurations.selectQuantization: LeveragesgetGPUMonitor,initialize, andgetStatsto help determine the optimal quantization level for a model based on the system's VRAM capacity.formatRecommendations: Likely usesgetGPUMonitor,initialize, andgetStatsto present hardware-aware recommendations to the user.
This integration ensures that LLM loading and execution are optimized for the specific hardware environment, reducing the risk of OOM errors and improving performance.
Usage Example
import { initializeGPUMonitor, getGPUMonitor, GPUMonitorConfig } from "./hardware/gpu-monitor.js";
async function main() {
class="hl-cmt">// Initialize the monitor (detects GPU, starts polling if autoPoll is true)
const monitor = await initializeGPUMonitor({
autoPoll: true,
pollInterval: 2000, class="hl-cmt">// Poll every 2 seconds
warningThreshold: 70,
criticalThreshold: 90,
safeBuffer: 1024, class="hl-cmt">// Keep 1GB free
});
console.log(`Detected GPU Vendor: ${monitor.getVendor()}`);
class="hl-cmt">// Subscribe to VRAM updates
monitor.on("vram:update", (stats) => {
console.log(`VRAM Update: ${stats.usagePercent.toFixed(1)}% used`);
class="hl-cmt">// console.log(monitor.formatStats()); // Uncomment for detailed output
});
class="hl-cmt">// Subscribe to warning/critical events
monitor.on("vram:warning", (stats) => {
console.warn(`🚨 VRAM Warning: ${stats.usagePercent.toFixed(1)}% used!`);
});
monitor.on("vram:critical", (stats) => {
console.error(`🔥 VRAM CRITICAL: ${stats.usagePercent.toFixed(1)}% used! Immediate action needed.`);
});
class="hl-cmt">// Get current stats immediately
const currentStats = await monitor.getStats();
console.log("\nInitial GPU Status:");
console.log(monitor.formatStats());
class="hl-cmt">// Calculate offloading recommendation for a 7B model (approx 4000MB, 32 layers)
const modelSizeMB = 4000; class="hl-cmt">// e.g., 7B Q4
const totalLayers = 32;
const contextSize = 4096;
const recommendation = monitor.calculateOffloadRecommendation(modelSizeMB, totalLayers, contextSize);
console.log("\nOffloading Recommendation for 7B Model:");
console.log(` Should Offload: ${recommendation.shouldOffload}`);
console.log(` Suggested GPU Layers: ${recommendation.suggestedGpuLayers}/${recommendation.maxGpuLayers}`);
console.log(` Reason: ${recommendation.reason}`);
console.log(` Estimated VRAM Usage: ${recommendation.estimatedVRAMUsage.toFixed(0)}MB`);
console.log(` Safe VRAM Limit: ${recommendation.safeVRAMLimit}MB`);
class="hl-cmt">// Get recommended layers for a common model size
const recommendedLayers7B = await monitor.getRecommendedLayers("7b");
console.log(`\nRecommended layers for a 39;7b39; model: ${recommendedLayers7B}`);
class="hl-cmt">// Simulate some work...
await new Promise(resolve => setTimeout(resolve, 10000));
class="hl-cmt">// Stop polling and dispose when done
monitor.dispose();
console.log("\nGPU Monitor disposed.");
}
main().catch(console.error);