src — agents

Module: src-agents Cohesion: 0.80 Members: 0

src — agents

The src/agents/model-failover.ts module provides a robust mechanism for managing a chain of Large Language Model (LLM) providers, enabling automatic failover when a primary provider becomes unavailable or unhealthy. This is crucial for building resilient applications that depend on external LLM services, ensuring continuous operation even if individual providers experience outages or rate limits.

Overview

This module defines the ModelFailoverChain class, which maintains an ordered list of LLM providers. When an application needs to interact with an LLM, it requests the "next available" provider from the chain. If a provider fails, it can be marked as unhealthy, and the chain will automatically attempt to use the next healthy provider. Unhealthy providers are put into a cooldown period before being re-evaluated for health.

Core Concepts

FailoverEntry

The FailoverEntry interface defines the structure for each individual LLM provider in the failover chain. It includes both configuration details and dynamic health status:

export interface FailoverEntry {
  provider: string; class="hl-cmt">// Unique identifier for the provider (e.g., 'grok', 'claude')
  model: string;    class="hl-cmt">// The specific model to use with this provider (e.g., 'grok-3', 'gpt-4o')
  apiKey?: string;  class="hl-cmt">// API key for authentication (can be a direct key or an env var name)
  baseURL?: string; class="hl-cmt">// Optional custom base URL for the API endpoint

  class="hl-cmt">// Internal state for failover logic
  healthy: boolean;
  lastError?: string;
  lastChecked?: number; class="hl-cmt">// Timestamp of last status update (healthy or failed)
  consecutiveFailures: number;
}

FailoverConfig

The FailoverConfig interface allows customization of the failover behavior:

export interface FailoverConfig {
  maxRetries: number;           class="hl-cmt">// (Currently external) Max consecutive failures before a provider is considered truly down.
  cooldownMs: number;           class="hl-cmt">// Time (in milliseconds) an unhealthy provider waits before being eligible for re-evaluation.
  healthCheckIntervalMs: number; class="hl-cmt">// (Currently external) Interval for proactive health checks.
}

It's important to note that maxRetries and healthCheckIntervalMs are currently intended for external orchestration. The ModelFailoverChain itself tracks consecutiveFailures but doesn't automatically transition a provider to unhealthy based on maxRetries. Similarly, healthCheckIntervalMs is a configuration hint for an external system to perform proactive health checks, rather than an internal timer within ModelFailoverChain. The cooldownMs is actively used by getNextProvider to re-enable providers.

ModelFailoverChain Class

The ModelFailoverChain class is the central component for managing the failover logic.

Constructor

constructor(chain?: Partial<FailoverEntry>[], config?: Partial<FailoverConfig>)

Initializes a new ModelFailoverChain instance.

Key Methods

addProvider(entry: Omit): void

Adds a new provider to the end of the failover chain. The healthy status is set to true and consecutiveFailures to 0 by default.

getNextProvider(): FailoverEntry | null

This is the core method for retrieving an LLM provider. It iterates through the chain to find the next available provider:

  1. It returns the first healthy provider it encounters.
  2. If an unhealthy provider's cooldownMs has elapsed since its lastChecked timestamp, it is automatically marked healthy again, its consecutiveFailures reset, and then returned.
  3. If no healthy or re-cooldown'd provider is found, it returns null.

markFailed(provider: string, error: string): void

Updates the status of a specified provider to unhealthy.

markHealthy(provider: string): void

Resets the status of a specified provider to healthy.

resetAll(): void

Resets the health status of all providers in the chain to healthy, clearing all failure-related state.

getStatus(): Array<{ provider: string; model: string; healthy: boolean; failures: number }>

Returns a simplified array of objects representing the current health status of each provider in the chain, useful for monitoring or debugging.

Static Factory Method: fromEnvironment(): ModelFailoverChain

This static method provides a convenient way to initialize a ModelFailoverChain based on environment variables. It checks for common API keys and automatically adds corresponding providers to the chain.

The current implementation checks for:

This method simplifies setup by allowing developers to configure their LLM providers purely through environment variables.

How it Works

The ModelFailoverChain operates as a stateful manager for a list of LLM providers.

  1. Initialization: A chain is created, either empty or pre-populated with FailoverEntry objects.
  2. Provider Request: When an LLM call is needed, the application calls getNextProvider().
  3. Health Check: getNextProvider() iterates through the configured providers:

  1. Failure Reporting: If an API call to the returned provider fails, the application must call markFailed(providerName, errorMessage) to update the provider's status in the chain. This marks the provider as unhealthy, increments its consecutiveFailures, and starts its cooldown timer.
  2. Recovery: After its cooldownMs period, an unhealthy provider becomes eligible for re-evaluation by getNextProvider(). If it's successfully used again, its health status is reset.

Failover Flow

graph TD
    A[Application Needs LLM] --> B{Call getNextProvider()};
    B --> C{Is current provider healthy?};
    C -- Yes --> D[Return Provider];
    C -- No --> E{Has cooldownMs passed for current provider?};
    E -- Yes --> F[Mark Provider Healthy, Reset Failures];
    F --> D;
    E -- No --> G{Move to next provider in chain};
    G --> C;
    G -- No more providers --> H[Return null];

    D --> I[Attempt LLM Call];
    I -- Success --> J[Continue Application];
    I -- Failure --> K[Call markFailed(provider, error)];
    K --> L[Provider marked Unhealthy, Cooldown starts];
    L --> A;

Integration and Usage

This module is designed to be integrated into an LLM orchestration layer or agent system.

  1. Setup the Chain:
    import { ModelFailoverChain } from &#39;./src/agents/model-failover&#39;;

    class="hl-cmt">// Option 1: From environment variables
    const failoverChain = ModelFailoverChain.fromEnvironment();

    class="hl-cmt">// Option 2: Manually
    const customChain = new ModelFailoverChain([
      { provider: &#39;my-primary&#39;, model: &#39;model-a&#39;, apiKey: process.env.PRIMARY_API_KEY },
      { provider: &#39;my-fallback&#39;, model: &#39;model-b&#39;, apiKey: process.env.FALLBACK_API_KEY },
    ], { cooldownMs: 30000 }); class="hl-cmt">// 30 seconds cooldown

  1. Get a Provider and Make a Call:
    async function makeLLMCall(prompt: string): Promise<string | null> {
      let providerEntry = failoverChain.getNextProvider();

      if (!providerEntry) {
        console.error(&#39;No healthy LLM providers available.&#39;);
        return null;
      }

      try {
        class="hl-cmt">// Example: Use providerEntry.provider, providerEntry.model, providerEntry.apiKey
        class="hl-cmt">// to initialize an LLM client and make a call.
        console.log(`Attempting call with ${providerEntry.provider} (${providerEntry.model})...`);
        class="hl-cmt">// const llmClient = new LLMClient(providerEntry); // Hypothetical client
        class="hl-cmt">// const response = await llmClient.generate(prompt);
        const response = await simulateLLMCall(providerEntry, prompt); class="hl-cmt">// Placeholder

        failoverChain.markHealthy(providerEntry.provider); class="hl-cmt">// Mark healthy on success
        return response;

      } catch (error: any) {
        console.error(`Call to ${providerEntry.provider} failed: ${error.message}`);
        failoverChain.markFailed(providerEntry.provider, error.message); class="hl-cmt">// Mark failed on error
        class="hl-cmt">// Optionally, retry with the next provider immediately or let the next call handle it
        return makeLLMCall(prompt); class="hl-cmt">// Recursive retry for demonstration
      }
    }

    class="hl-cmt">// Placeholder for actual LLM interaction
    async function simulateLLMCall(entry: FailoverEntry, prompt: string): Promise<string> {
      class="hl-cmt">// Simulate success or failure based on some condition
      if (entry.provider === &#39;grok&#39; && Math.random() < 0.3) { class="hl-cmt">// Grok fails 30% of the time
        throw new Error(&#39;Simulated Grok API error&#39;);
      }
      if (entry.provider === &#39;claude&#39; && Math.random() < 0.1) { class="hl-cmt">// Claude fails 10% of the time
        throw new Error(&#39;Simulated Claude API error&#39;);
      }
      return `Response from ${entry.provider} using model ${entry.model} for prompt: "${prompt}"`;
    }

    class="hl-cmt">// Example usage
    (async () => {
      console.log(&#39;Initial status:&#39;, failoverChain.getStatus());
      for (let i = 0; i < 10; i++) {
        const result = await makeLLMCall(&#39;Tell me a short story.&#39;);
        console.log(`Attempt ${i + 1}:`, result);
        console.log(&#39;Current status:&#39;, failoverChain.getStatus());
        await new Promise(resolve => setTimeout(resolve, 1000)); class="hl-cmt">// Wait a bit
      }
    })();

Connections to the Codebase

While the provided call graph primarily shows usage from test files (model-failover.test.ts, fallback-chain.test.ts), this module is designed as a foundational utility for any part of the application that needs to interact with external LLM providers.