In the first quarter of 2026, the three largest LLM providers — Anthropic, OpenAI, and Google — collectively reported over 100 service incidents. Some lasted minutes. Some lasted hours. All of them caused engineering teams downstream to scramble.
Most teams responded the same way: they opened the provider's status page, refreshed it repeatedly, and waited. Their agents were down. Their users were getting errors. There was nothing they could do except wait for a company they have no contract with, no SLA from, and no influence over to restore service.
This is the single-provider problem. And it has a straightforward fix — one that most teams don't implement until after their first serious incident.
The incident data you should know
Independent monitoring services tracking Anthropic, OpenAI, and Google API health show a consistent picture heading into 2026:
| Provider | API uptime (90 days) | Downtime implied per month | Notable recent incidents |
|---|---|---|---|
| Anthropic | 99.09% Source: status.claude.com | ~3.9 hours | Elevated errors on Opus 4.7 (May 25, 28, 30 2026) |
| OpenAI | 99.83% Source: status.openai.com | ~1.2 hours | Global outage April 20 2026 (8,700+ user reports) |
| Google (Gemini) | Degraded periods Recurring | Variable | Intermittent hanging requests recurring since Feb 2026 |
The uptime percentages look high until you do the math: 99.09% uptime means Anthropic's API was unavailable or degraded for roughly 4 hours every month on average. For teams running AI agents in production, this is not theoretical. These are real hours where your agents are failing, retrying, or returning errors to users.
Outages are visible. Latency degradation is invisible. Providers frequently enter degraded states where they don't return errors — they just respond in 25-40 seconds instead of the usual 3-8 seconds. Your agent doesn't fail. It just hangs, burning tokens and frustrating users while your monitoring shows green.
What a fallback chain actually is
A fallback chain is a prioritized sequence of LLM providers your agent tries in order. The primary provider handles all requests under normal conditions. When it fails — or becomes too slow — the runtime automatically routes to the next provider in the chain, without any change to your agent logic and without any manual intervention.
The key word is automatically. A fallback chain is not a runbook. It is not a Slack message telling someone to switch the API key. It is not a feature flag you toggle at 2am. It is a runtime capability that fires before your users experience a failure.
In practice, the vast majority of requests hit the primary provider and never touch the fallback chain. The chain exists for the 0.5-2% of requests where the primary is unavailable or degraded — but that 0.5-2% is exactly the difference between 99.5% uptime and 100% uptime for your users.
The two types of fallback triggers
Most teams who implement fallback chains only handle the obvious case: hard errors. A 500 from the API, a connection timeout, a rate limit response. These are easy to catch and easy to route around.
The more important case — the one that actually catches most real incidents — is latency-based failover.
Error-based failover
The provider returns an error response. Your runtime catches it and retries against the next provider in the chain. This handles:
- HTTP 500, 502, 503 server errors
- HTTP 429 rate limit responses (with brief backoff)
- Connection timeouts — provider unreachable entirely
- Authentication errors after API key rotation issues
- Model-specific errors (model overloaded, model deprecated)
Latency-based failover
The provider responds — but too slowly. Your runtime measures time-to-first-token and routes away from providers operating above your latency threshold. This is harder to implement correctly but catches far more real-world degradation events.
fallback_policy:
primary: "claude-opus-4-7"
fallback_1: "gpt-5.4"
fallback_2: "gemini-3.1-pro-preview"
# Error-based triggers
failover_on_errors: [500, 502, 503, 429, timeout]
# Latency-based triggers
latency_threshold_ms: 8000 # time to first token
latency_window: 5 # consecutive slow responses before switch
latency_recovery_min: 10 # minutes before trying primary again
# Circuit breaker
circuit_open_after: 3 # failures before opening circuit
circuit_half_open_after: 60 # seconds before probing primary again
Don't failover on a single slow response — providers have natural variance. Set a window of 3-5 consecutive slow responses before switching. This prevents thrashing between providers during normal variance while still catching genuine degradation within seconds.
The circuit breaker pattern
A naive fallback chain has a problem: every request to a failed primary adds latency. The runtime has to wait for the primary to timeout before trying the fallback. If your timeout is 10 seconds and the primary is down, every user request takes 10 seconds before it gets a response from the fallback.
The circuit breaker pattern solves this. After N consecutive failures, the circuit "opens" — the runtime stops sending requests to the failed provider entirely and routes directly to the fallback without waiting. After a recovery window, it sends a single probe request to check if the primary has recovered. If it succeeds, the circuit closes and traffic returns to the primary.
class CircuitBreaker:
def __init__(self, provider, threshold=3, recovery_s=60):
self.provider = provider
self.threshold = threshold
self.recovery_s = recovery_s
self.failures = 0
self.state = "closed" # closed | open | half-open
self.opened_at = None
def should_route(self) -> bool:
if self.state == "closed":
return True
if self.state == "open":
# Check if recovery window has passed
elapsed = time.time() - self.opened_at
if elapsed > self.recovery_s:
self.state = "half-open"
return True # Allow probe request
return False # Skip this provider entirely
return True # half-open: allow the probe
def record_success(self):
self.failures = 0
self.state = "closed"
def record_failure(self):
self.failures += 1
if self.failures >= self.threshold:
self.state = "open"
self.opened_at = time.time()
With a circuit breaker in place, a fully down primary provider adds zero latency to your users. The runtime knows the circuit is open and routes directly to the fallback without attempting the primary at all.
The model compatibility problem
Here is the part most fallback implementations get wrong: different models have different capabilities, tool calling schemas, and output token limits. You cannot always swap claude-opus-4-7 for gpt-5.4 without adjustment.
Specific differences that matter in practice:
- Tool calling schemas differ: Anthropic uses
toolswithinput_schema. OpenAI usestoolswithfunction. Gemini usestoolswithfunctionDeclarations. Your runtime must translate between these schemas transparently. - Output token limits vary significantly: claude-opus-4-7 supports up to 128K output tokens. gpt-5.4 supports 128K. gemini-3.1-pro-preview is capped at 65K output tokens. For agents generating long outputs, a fallback to Gemini may silently truncate responses.
- Tokenization differs across providers: claude-opus-4-7's updated tokenizer produces up to 35% more tokens than 4.6 for identical input. Cross-provider variance is larger still. A prompt consuming 80K tokens on Claude may use a different count on GPT — affecting both cost and context window usage.
- Structured output reliability varies: All three major models support JSON mode, but reliability on complex nested schemas differs. What works consistently on claude-opus-4-7 may produce occasional malformed output on the fallback provider.
- Reasoning/thinking token handling: claude-opus-4-7 and gemini-3.1-pro-preview both support extended thinking modes. gpt-5.4 handles this differently via its o-series reasoning architecture. If your agent uses thinking tokens, this needs explicit handling in your adapter layer.
The cleanest approach is a provider adapter layer — a translation interface that normalizes the request and response format for each provider. Your agent logic talks to the adapter. The adapter translates to each provider's native format. When the fallback fires, only the adapter changes — your agent code is untouched.
Cost implications of fallback routing
One concern teams raise: fallback providers may be more expensive. This is worth thinking through clearly.
Here are the actual verified 2026 pricing numbers for the three major fallback chain models, sourced from official provider pricing pages:
| Model | Input (per M tokens) | Output (per M tokens) | Context window |
|---|---|---|---|
| claude-opus-4-7 | $5.00 | $25.00 | 1M input / 128K output |
| gpt-5.4 | $2.50 | $15.00 | 1.05M tokens |
| gemini-3.1-pro-preview | $2.00 | $12.00 | 1M tokens |
Note: OpenAI released GPT-5.5 in April 2026 at $5/$30 per M tokens — their newest flagship. GPT-5.4 at $2.50/$15 remains the recommended production workhorse for most teams due to its better cost-to-quality ratio. For a fallback chain, GPT-5.4 is the right choice for fallback 1.
The cost math: if 98% of requests hit your primary provider and 2% hit fallbacks, the cost impact of the fallback chain is under 3% of your total LLM spend. The cost of not having a fallback — engineer time during incidents, user churn, SLA penalties — is orders of magnitude higher.
What this looks like in production
A real fallback chain in a production agent runtime needs to handle all of this automatically — error detection, latency measurement, circuit breaking, schema translation, cost attribution per provider. Here is a minimal but production-grade implementation pattern:
async def call_with_fallback(
prompt: str,
tools: list,
chain: list = ["claude-opus-4-7", "gpt-5.4", "gemini-3.1-pro-preview"]
) -> LLMResponse:
for model in chain:
breaker = circuit_breakers[model]
# Skip if circuit is open
if not breaker.should_route():
log(f"Circuit open for {model}, skipping")
continue
try:
start = time.monotonic()
# Translate prompt + tools to provider format
request = adapters[model].build_request(prompt, tools)
# Call with latency tracking
response = await call_provider(model, request)
latency_ms = (time.monotonic() - start) * 1000
# Check latency threshold
if latency_ms > LATENCY_THRESHOLD_MS:
latency_trackers[model].record_slow()
if latency_trackers[model].should_failover():
log(f"{model} too slow ({latency_ms:.0f}ms), failing over")
continue
# Success — reset circuit and return
breaker.record_success()
return adapters[model].normalize_response(response, model)
except (ProviderError, TimeoutError) as e:
breaker.record_failure()
log(f"{model} failed: {e}, trying next provider")
continue
raise AllProvidersFailedError("All providers in chain exhausted")
This pattern gives you automatic failover, latency-based routing, circuit breaking, and normalized responses — all transparent to your agent logic. The agent calls call_with_fallback() and gets a response. It never knows which provider served it.
The operational reality
Teams that implement fallback chains well report similar outcomes: provider incidents become background noise instead of all-hands emergencies. The on-call engineer doesn't get paged because the agent rerouted automatically within milliseconds. The status page refreshing stops.
The teams that don't implement fallback chains discover the need for them during incidents — usually at the worst possible time, when they're already dealing with a customer escalation, an angry Slack thread, and a CEO asking why the product is down.
The implementation complexity is real but bounded. The provider adapter layer takes a day or two to build correctly. The circuit breaker is another few hours. The latency tracking is straightforward. Total engineering investment: 3-5 days for a production-grade implementation.
The alternative — a manual runbook that someone has to execute while half-asleep at 3am — is not a production-grade reliability strategy. It is a temporary arrangement that eventually fails at exactly the wrong moment.
Velorith implements all of this at the runtime layer — error-based failover, latency-based routing, circuit breaking, provider adapter translation, and per-provider cost attribution — out of the box, configured in a single YAML block. Your agent logic doesn't change. You just get 99.97%+ availability regardless of which provider is having a bad day.