Your LLM Provider Will Go Down. The Question Is Whether Your Agent Goes With It.
On April 20, 2026, OpenAI's routing layer ran out of memory. Readiness checks started failing, capacity collapsed under the inbound traffic, and ChatGPT, Codex, and the API went dark for two hours and thirty-five minutes. The public incident post-mortem tells the story in the dry register of a status page. The story your agents tell is louder. They retry. They retry again. They start filling tool-call slots with timed-out futures. They escalate to a human queue that is now flooded. Your customer-facing copy says "we're having trouble — please try again in a moment" for one hundred and fifty-five minutes.
This is the wrong failure mode.
Hosted LLM APIs do not have the uptime profile of traditional cloud infrastructure. Universal Cloud's analysis measured 99.3% uptime on a representative LLM API — over five hours of downtime per month, roughly a factor of seven worse than the four-nines numbers operators are conditioned to expect from their database, cache, or object store. The same analysis put median time-to-repair for the OpenAI API at 1.23 hours and the Anthropic API at 0.77 hours. The corollary that matters more than the absolute numbers: LLM outages are global. There is no us-east-1-to-us-west-2 failover. When the provider goes, the provider goes everywhere.
If you treat your agent as a thin wrapper around a single hosted model, every one of those minutes is a minute your agent is degraded. The pattern your platform team needs is older than agents and well-understood in SRE: a circuit breaker per provider, an error budget per tenant, and a policy that decides what an agent is allowed to do when the budget runs out.
The Three-State Pattern, Applied to a Provider
A circuit breaker is not exotic. Google's SRE workbook describes the primitive, Netdata's overview walks through the math, and the Hystrix-era pattern has been around long enough that every backend engineer has implemented it at least once. What makes the LLM application interesting is the variable being protected. Traditional breakers protect downstream services from cascading load. LLM breakers also protect the agent's behavior — because an agent that keeps retrying a degraded model does not just waste budget, it produces lower-quality output that flows into trust scores, audit logs, and downstream tool calls before anyone notices the provider was the problem.
Cortex's circuit breaker is the standard three-state machine, scoped per provider-and-model pair so a partial outage on one model does not blackhole traffic on another:
Closed. Normal operation. Every request flows. Failures and successes are counted. Consecutive failures up to a configurable threshold are tolerated; the default is five.
Open. Tripped. Requests fail fast — no call to the provider, no waiting on a 30-second timeout, no retry storm. The breaker stays open for a configurable cooldown; the default is sixty seconds. This is the state that prevents the cascade. A failing provider does not get re-hammered by every agent at once.
Half-open. After the cooldown, exactly one request gets a probe. If it succeeds — and then succeeds again — the breaker closes and traffic resumes. If it fails, the breaker re-opens for another sixty seconds. The half-open state is the difference between "we noticed you came back" and "we did not stay down longer than we needed to."
Across providers, the registry is keyed by provider/model, persisted to Redis so a multi-worker fleet shares state instead of each worker re-learning the outage independently, and exposed via a monitoring endpoint that returns the current state and recent failure rate per breaker. An admin can also force-trip or manually reset — useful when you know the provider is down before the breaker has caught up, and useful when you know the provider is back before the cooldown has expired.
Error Budgets Decide What Happens Next
A breaker tells you what to do this second. An error budget tells you what to do this hour.
The error budget is the operator-facing version of the SLO. For an availability target of 99.9%, the budget is 0.1% of total requests — in a 24-hour rolling window with ten thousand requests, that's ten allowable failures. Burn through them in the first hour and you have a problem the breaker alone won't solve. The breaker will eventually re-close. The budget says the model is failing badly enough that something upstream has to change.
The default action when a tenant's budget exhausts is restrict. Agents do not stop running, but their autonomy is downgraded — long-running multi-step plans are paused, tool calls that require high confidence are routed to human review, and the trust tier of every agent in the affected tenant moves toward the conservative end of its scoring profile. The alternative action is alert only: log the burn, page the operator, and leave the autonomy unchanged. Both have legitimate uses. A customer-support agent on a Pro tenant probably wants restrict. A backoffice agent on a Max tenant with a human reviewer already in the loop probably wants alert only and a Slack ping. The choice is in the tenant's LLM configuration, alongside the availability target, the latency p95 target, and the rolling window length.
There is a second-order benefit that is easy to miss the first time you read the math. With two independent LLM providers each operating at 99.3% uptime, the probability of both being down simultaneously is roughly 0.0049% — an effective uptime around 99.995%. The breaker is what makes multi-provider fallback possible. Without it, a request that times out on provider A and retries on provider B has already spent the agent's latency budget on A. With it, the breaker for A is open, the call falls through to B in milliseconds, and the agent meets its SLA even when the provider doesn't.
What This Looks Like When You Wire It In
You instrument the LLM call site. Before the request, you ask the breaker for the provider and model whether the request is allowed. If yes, you make the call. On success you record the latency. On failure — timeout, 5xx, content-policy rejection that should not have happened, malformed response — you record the failure. The breaker does the rest. A monitoring endpoint exposes the current state of every breaker, the failure rate over the window, and the time remaining before an open breaker tries its half-open probe. The error budget endpoint exposes the same data at the SLO layer: total requests, failures, budget consumed, burn rate, and an estimated time-to-exhaustion at the current rate.
The pattern composes with the rest of the trust layer. When a breaker trips, the event is recorded to Vault — every transition between closed, open, and half-open is part of the chain, which means an auditor can reconstruct exactly which provider failed when and how the system responded. When the error budget exhausts and restrict fires, the autonomy downgrade is a scored event in Gate — the same trust scoring that catches behavioral drift catches infrastructure-driven drift. When the breaker is forced open by an admin, that action is signed by the operator's credential. None of this is bolted on. It is the same trust surface used for every other policy decision, with the provider as the subject of the decision instead of the agent.
The Failure Mode You Want
The right failure mode looks like this. An LLM provider has an outage. Within five consecutive failures — typically under thirty seconds — the breaker opens. New requests for the affected model fail fast and either fall through to the configured fallback chain or return a structured provider unavailable response that the calling agent treats as a known degraded state rather than an exception to retry. The tenant's error budget begins to deplete; if the outage is short, the budget absorbs it and nothing else changes. If the outage is long enough to exhaust the budget, autonomy auto-downgrades and human reviewers are pulled in for the work that requires high confidence. The status of all of this is visible on a dashboard that does not require reading the provider's status page to interpret.
Two hours and thirty-five minutes is a long time. It is too long to spend silently retrying. The pattern that survives it is older than the model your agent is talking to.
Get started: Cortex circuit breakers are on by default for every tenant. The SLO target, error budget window, and budget-exhausted action are configurable per tenant in your LLM configuration. The SRE dashboard at /v1/analytics/sre/dashboard exposes current breaker state, error budgets, and burn rates. Sign up for the free tier to instrument your agents — the same Gate events that drive trust scoring drive the breaker state, which means you get this for free the moment you start sending events.
Sources: