tests — agents
tests — agents
This document describes the ModelFailoverChain module, located at src/agents/model-failover.ts. While the provided source code is a test file (tests/agents/model-failover.test.ts), this documentation focuses on the core ModelFailoverChain class and its associated components, which are thoroughly tested by the provided suite.
Model Failover Chain Module (src/agents/model-failover.ts)
This module provides a robust mechanism for managing and failing over between multiple Large Language Model (LLM) providers. It tracks the health and failure status of each configured provider, allowing applications to gracefully switch to an alternative when a primary provider experiences issues (e.g., rate limits, service outages).
Purpose
The ModelFailoverChain class is designed to:
- Maintain a list of available LLM providers and their current operational status.
- Automatically select the next healthy provider in a round-robin fashion.
- Mark providers as unhealthy upon failure and re-evaluate their health after a configurable cooldown period.
- Provide a clear API for updating provider status and retrieving the current chain state.
- Support convenient initialization from environment variables for common LLM providers.
Key Components
FailoverEntry (Type)
Represents the status of a single LLM provider within the failover chain. This type is used internally by ModelFailoverChain to track the state of each provider.
interface FailoverEntry {
provider: string; class="hl-cmt">// The identifier for the LLM provider (e.g., 39;grok39;, 39;claude39;, 39;chatgpt39;, 39;gemini39;)
model: string; class="hl-cmt">// The specific model being used (e.g., 39;grok-339;, 39;claude-sonnet-4-2025051439;)
healthy: boolean; class="hl-cmt">// True if the provider is currently considered healthy
failures: number; class="hl-cmt">// Consecutive failures recorded for this provider
lastChecked?: number; class="hl-cmt">// Timestamp (ms) of the last time this provider was checked or failed
}
ModelFailoverChain (Class)
The central class managing the failover logic. It maintains an ordered list of FailoverEntry objects and provides methods to interact with their status.
classDiagram
class ModelFailoverChain {
-chain: FailoverEntry[]
-options: { cooldownMs: number }
+constructor(initialProviders?: FailoverEntry[], options?: object)
+addProvider(providerConfig: { provider: string, model: string }): void
+getStatus(): FailoverEntry[]
+getNextProvider(): FailoverEntry | null
+markFailed(providerName: string, reason: string): void
+markHealthy(providerName: string): void
+resetAll(): void
+static fromEnvironment(): ModelFailoverChain
}
class FailoverEntry {
+provider: string
+model: string
+healthy: boolean
+failures: number
+lastChecked?: number
}
ModelFailoverChain "1" *-- "0..*" FailoverEntry : manages
Core Functionality
constructor(initialProviders?: FailoverEntry[], options?: { cooldownMs?: number })
- Initializes a new
ModelFailoverChaininstance. - Can be optionally provided with an array of
FailoverEntryobjects to pre-populate the chain. options.cooldownMs: Configures the duration (in milliseconds) a failed provider remains unhealthy beforegetNextProviderattempts to re-check it. Defaults to5 60 1000(5 minutes).
addProvider(providerConfig: { provider: string, model: string }): void
- Adds a new LLM provider to the failover chain.
- The provider is initially marked as
healthy: truewithfailures: 0.
getStatus(): FailoverEntry[]
- Returns a copy of the current state of all providers in the chain.
- Each element in the array is a
FailoverEntryobject, reflecting its current health, failure count, and other metadata.
getNextProvider(): FailoverEntry | null
- This is the primary method for obtaining an LLM provider to use.
- It iterates through the configured providers to find the next healthy one in a round-robin fashion.
- Cooldown Logic: If a provider is currently
healthy: false,getNextProviderchecks if itslastCheckedtimestamp plus the configuredcooldownMshas passed. If the cooldown has expired, the provider is temporarily markedhealthy: trueand returned, giving it another chance. If the subsequent call fails, it will be markedhealthy: falseagain. - Returns
nullif all providers are currently unhealthy and within their cooldown period.
markFailed(providerName: string, reason: string): void
- Updates the status of a specific provider, indicating a failure.
- Locates the provider by
providerName. - Increments its
failurescount. - Sets
healthy: false. - Updates
lastCheckedto the current timestamp, initiating the cooldown period. - The
reasonparameter is for logging or debugging purposes and is not stored in theFailoverEntry.
markHealthy(providerName: string): void
- Resets the status of a specific provider, marking it as fully operational.
- Locates the provider by
providerName. - Resets its
failurescount to0. - Sets
healthy: true.
resetAll(): void
- Resets the status of all providers in the chain.
- Marks every provider as
healthy: trueand sets theirfailurescount to0. - Useful for recovering from a widespread outage or for resetting the state during testing.
static fromEnvironment(): ModelFailoverChain
- A static factory method that constructs a
ModelFailoverChaininstance by checking common environment variables for API keys. - It automatically adds providers like 'grok', 'claude', 'chatgpt', and 'gemini' if their respective API keys (
GROK_API_KEY,ANTHROPIC_API_KEY,OPENAI_API_KEY,GOOGLE_API_KEY) are present inprocess.env. - This provides a convenient way to configure the failover chain without explicit code, especially for common LLM integrations.
How it Works (Execution Flow)
When an application needs to make an LLM call:
- The application calls
chain.getNextProvider()to get a suitable provider. - The
ModelFailoverChainiterates through its internal list ofFailoverEntryobjects, applying its health and cooldown logic. - If a provider is returned, the application attempts to make an LLM call using that provider.
- If the LLM call succeeds: No further action is needed regarding the failover chain for that specific call.
- If the LLM call fails: The application should call
chain.markFailed(providerName, reason)for the provider that failed. This updates the provider's status, making it less likely to be chosen immediately again and starting its cooldown period. - If a previously failed provider starts working again: The application can call
chain.markHealthy(providerName)to restore its full health status, making it immediately available for selection bygetNextProvider().
This module provides the foundational logic for building resilient LLM agent systems that can gracefully handle transient or persistent issues with individual model providers, improving the overall reliability of LLM-powered applications.