Intelligent LLM routing with safety controls. Cortex decides how agent work should run once trust has been established -- routing queries to the right model, managing provider failures, enforcing budgets, and keeping Guard active throughout execution.
Plan requirement: Cortex is available on all plans. Some features (budget enforcement, advanced observability) may vary by plan tier.
Cortex supports six LLM providers out of the box:
| Provider | Models | Auth |
|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, GPT-4-turbo, o1, o3-mini | API key |
| Anthropic | Claude Sonnet 4, Claude Haiku 3.5 | API key |
| Gemini 2.5 Pro, Gemini 2.5 Flash | API key | |
| Azure OpenAI | GPT-4o, GPT-4-turbo (via Azure deployments) | API key + endpoint |
| Mistral | Mistral Large, Mistral Small | API key |
| LiteLLM / Ollama | Any model supported by LiteLLM or local Ollama instance | API key or local |
Configure providers in the agent's LLM settings (Manage Agent > Settings) or via the agent config API.
Cortex classifies each incoming query as simple or complex and routes it to the appropriate model.
The classifier evaluates:
Each agent has a simple model and a complex model configured:
This reduces cost by 40-70% for typical conversational workloads while maintaining quality for complex tasks.
When a provider fails (timeout, rate limit, server error), Cortex automatically falls through to the next available option:
Fallback is transparent to the user. The response includes metadata indicating which model actually served it.
Cortex monitors provider health in real time:
| State | Meaning |
|---|---|
healthy |
Provider is responding normally |
degraded |
Elevated error rate, still accepting requests |
cooldown |
Temporarily removed from rotation after repeated failures |
recovering |
Cooldown expired, probe request in flight |
Before sending a request, Cortex checks whether the conversation context fits the target model's context window:
Cortex estimates and records cost for every LLM request:
Each model has a registered cost per input token and output token. The registry is updated when providers change pricing.
After each request, Cortex records:
Cost data is aggregated per agent, per tenant, and per time period. View cost breakdowns on the agent manage page or query via the API.
Cortex caches LLM responses to reduce cost and latency for repeated queries:
Set per-tenant cost limits to prevent runaway spending:
block (deny all LLM requests), degrade (route only to cheapest models), or alert (send notification but continue)Configure budgets via the API or dashboard settings.
The llm config object in agent settings controls Cortex behavior:
{
"llm": {
"provider": "openai",
"model": "gpt-4o",
"simple_model": "gpt-4o-mini",
"fallback_provider": "anthropic",
"fallback_model": "claude-sonnet-4-20250514",
"temperature": 0.7,
"max_tokens": 4096,
"system_prompt": "You are a helpful support agent.",
"context_window_limit": 128000
}
}
| Field | Type | Default | Description |
|---|---|---|---|
provider |
string | Required | Primary LLM provider |
model |
string | Required | Primary model for complex queries |
simple_model |
string | Same as model |
Model for simple queries |
fallback_provider |
string | None | Fallback provider |
fallback_model |
string | None | Fallback model |
temperature |
float | 0.7 |
Sampling temperature (0.0-2.0) |
max_tokens |
int | 4096 |
Maximum output tokens |
system_prompt |
string | Template default | System prompt prepended to all requests |
context_window_limit |
int | Model default | Override context window limit |
POST /v1/agents/llm/chat
Send a message and receive a complete response.
curl -X POST https://api.veriswarm.ai/v1/agents/llm/chat \
-H "x-api-key: vsk_your_key" \
-H "Content-Type: application/json" \
-d '{
"agent_id": "agt_123",
"message": "Summarize our Q1 sales performance",
"conversation_id": "conv_456"
}'
Response:
{
"response": "Based on your Q1 data...",
"model_used": "gpt-4o",
"complexity": "complex",
"tokens": {"input": 2340, "output": 512},
"cost_usd": 0.0284,
"conversation_id": "conv_456",
"cached": false
}
POST /v1/agents/llm/stream
Same parameters as /chat, but returns a Server-Sent Events (SSE) stream for real-time token delivery.
curl -X POST https://api.veriswarm.ai/v1/agents/llm/stream \
-H "x-api-key: vsk_your_key" \
-H "Content-Type: application/json" \
-d '{
"agent_id": "agt_123",
"message": "Explain our refund policy",
"conversation_id": "conv_456"
}'
Each SSE event contains a delta field with the next token(s). The final event includes model_used, tokens, and cost_usd.
Cortex emits structured events at each stage of the request lifecycle. These feed into the agent's conversation logs, Vault audit trail, and any external observability systems you configure.
| Hook | Triggered When |
|---|---|
pre_call |
Before sending the request to the LLM provider |
success |
LLM response received successfully |
failure |
LLM request failed (timeout, rate limit, server error) |
retry |
Retrying a failed request (before fallback) |
fallback |
Switching to fallback provider after primary failure |
cooldown |
Provider placed into cooldown state |
cache_hit |
Response served from semantic cache |
budget_exceeded |
Request blocked or degraded due to budget limit |
context_upgrade |
Model upgraded due to context window overflow |
Each hook includes the agent ID, conversation ID, model, latency, token counts, and cost estimate.