tests — agents

Module: tests-agents Cohesion: 0.80 Members: 0

tests — agents

This document describes the ModelFailoverChain module, located at src/agents/model-failover.ts. While the provided source code is a test file (tests/agents/model-failover.test.ts), this documentation focuses on the core ModelFailoverChain class and its associated components, which are thoroughly tested by the provided suite.

Model Failover Chain Module (src/agents/model-failover.ts)

This module provides a robust mechanism for managing and failing over between multiple Large Language Model (LLM) providers. It tracks the health and failure status of each configured provider, allowing applications to gracefully switch to an alternative when a primary provider experiences issues (e.g., rate limits, service outages).

Purpose

The ModelFailoverChain class is designed to:

Key Components

FailoverEntry (Type)

Represents the status of a single LLM provider within the failover chain. This type is used internally by ModelFailoverChain to track the state of each provider.

interface FailoverEntry {
  provider: string;      class="hl-cmt">// The identifier for the LLM provider (e.g., 'grok', 'claude', 'chatgpt', 'gemini')
  model: string;         class="hl-cmt">// The specific model being used (e.g., 'grok-3', 'claude-sonnet-4-20250514')
  healthy: boolean;      class="hl-cmt">// True if the provider is currently considered healthy
  failures: number;      class="hl-cmt">// Consecutive failures recorded for this provider
  lastChecked?: number;  class="hl-cmt">// Timestamp (ms) of the last time this provider was checked or failed
}

ModelFailoverChain (Class)

The central class managing the failover logic. It maintains an ordered list of FailoverEntry objects and provides methods to interact with their status.

classDiagram
    class ModelFailoverChain {
        -chain: FailoverEntry[]
        -options: { cooldownMs: number }
        +constructor(initialProviders?: FailoverEntry[], options?: object)
        +addProvider(providerConfig: { provider: string, model: string }): void
        +getStatus(): FailoverEntry[]
        +getNextProvider(): FailoverEntry | null
        +markFailed(providerName: string, reason: string): void
        +markHealthy(providerName: string): void
        +resetAll(): void
        +static fromEnvironment(): ModelFailoverChain
    }
    class FailoverEntry {
        +provider: string
        +model: string
        +healthy: boolean
        +failures: number
        +lastChecked?: number
    }
    ModelFailoverChain "1" *-- "0..*" FailoverEntry : manages

Core Functionality

  1. constructor(initialProviders?: FailoverEntry[], options?: { cooldownMs?: number })

  1. addProvider(providerConfig: { provider: string, model: string }): void

  1. getStatus(): FailoverEntry[]

  1. getNextProvider(): FailoverEntry | null

  1. markFailed(providerName: string, reason: string): void

  1. markHealthy(providerName: string): void

  1. resetAll(): void

  1. static fromEnvironment(): ModelFailoverChain

How it Works (Execution Flow)

When an application needs to make an LLM call:

  1. The application calls chain.getNextProvider() to get a suitable provider.
  2. The ModelFailoverChain iterates through its internal list of FailoverEntry objects, applying its health and cooldown logic.
  3. If a provider is returned, the application attempts to make an LLM call using that provider.
  4. If the LLM call succeeds: No further action is needed regarding the failover chain for that specific call.
  5. If the LLM call fails: The application should call chain.markFailed(providerName, reason) for the provider that failed. This updates the provider's status, making it less likely to be chosen immediately again and starting its cooldown period.
  6. If a previously failed provider starts working again: The application can call chain.markHealthy(providerName) to restore its full health status, making it immediately available for selection by getNextProvider().

This module provides the foundational logic for building resilient LLM agent systems that can gracefully handle transient or persistent issues with individual model providers, improving the overall reliability of LLM-powered applications.