Cortex -- Runtime

Intelligent LLM routing with safety controls. Cortex decides how agent work should run once trust has been established -- routing queries to the right model, managing provider failures, enforcing budgets, and keeping Guard active throughout execution.

Plan requirement: Cortex is available on all plans. Some features (budget enforcement, advanced observability) may vary by plan tier.

Multi-Provider Support

Cortex supports six LLM providers out of the box:

Provider	Models	Auth
OpenAI	GPT-4o, GPT-4o-mini, GPT-4-turbo, o1, o3-mini	API key
Anthropic	Claude Sonnet 4, Claude Haiku 3.5	API key
Google	Gemini 2.5 Pro, Gemini 2.5 Flash	API key
Azure OpenAI	GPT-4o, GPT-4-turbo (via Azure deployments)	API key + endpoint
Mistral	Mistral Large, Mistral Small	API key
LiteLLM / Ollama	Any model supported by LiteLLM or local Ollama instance	API key or local

Configure providers in the agent's LLM settings (Manage Agent > Settings) or via the agent config API.

Complexity-Based Routing

Cortex classifies each incoming query as simple or complex and routes it to the appropriate model.

How Classification Works

The classifier evaluates:

Query length and structural complexity
Presence of multi-step reasoning, code generation, or analysis requests
Conversation history depth and context requirements
Whether the query references knowledge base content

Model Selection

Each agent has a simple model and a complex model configured:

Simple queries (greetings, short factual questions, simple lookups) route to the smaller, faster, cheaper model (e.g., GPT-4o-mini, Claude Haiku 3.5, Gemini Flash)
Complex queries (multi-step reasoning, long-form generation, code review, analysis) route to the larger, more capable model (e.g., GPT-4o, Claude Sonnet, Gemini Pro)

This reduces cost by 40-70% for typical conversational workloads while maintaining quality for complex tasks.

Fallback Chains

When a provider fails (timeout, rate limit, server error), Cortex automatically falls through to the next available option:

Primary model -- the configured model for the query's complexity level
Fallback model -- a secondary model from a different provider (configured per agent)
Content policy fallback -- a safety-focused model used when primary and fallback both fail, ensuring the agent can still respond (even if with a simpler response)

Fallback is transparent to the user. The response includes metadata indicating which model actually served it.

Provider Health Tracking

Cortex monitors provider health in real time:

Error rate tracking -- consecutive failures and error rate percentage per provider
Automatic cooldown -- when a provider exceeds the error threshold, Cortex stops routing to it for a configurable cooldown period
Health recovery -- after cooldown, Cortex sends a single probe request. If it succeeds, the provider is restored to the active rotation
Cross-tenant isolation -- health state is tracked per tenant, so one workspace's provider issues do not affect another

Health States

State	Meaning
`healthy`	Provider is responding normally
`degraded`	Elevated error rate, still accepting requests
`cooldown`	Temporarily removed from rotation after repeated failures
`recovering`	Cooldown expired, probe request in flight

Context Window Validation

Before sending a request, Cortex checks whether the conversation context fits the target model's context window:

If the context exceeds the model's limit, Cortex automatically upgrades to a model with a larger context window (e.g., GPT-4o-mini 128k to GPT-4o 128k, or to a model with a longer context)
Context compression is applied when possible -- older conversation turns are summarized to fit within limits
If no model can accommodate the context, the agent returns an appropriate message rather than failing silently

Cost Tracking

Cortex estimates and records cost for every LLM request:

Model Cost Registry

Each model has a registered cost per input token and output token. The registry is updated when providers change pricing.

Per-Request Estimation

After each request, Cortex records:

Model used (including fallback model if triggered)
Input tokens and output tokens
Estimated cost in USD
Latency

Aggregation

Cost data is aggregated per agent, per tenant, and per time period. View cost breakdowns on the agent manage page or query via the API.

Semantic Caching

Cortex caches LLM responses to reduce cost and latency for repeated queries:

Tenant-scoped -- each workspace has its own cache. No cross-tenant cache hits.
Semantic matching -- queries are matched by semantic similarity, not exact string match. "What are your hours?" and "When are you open?" hit the same cache entry.
TTL-based expiry -- cached responses expire after a configurable period (default: 1 hour).
Cache bypass -- certain query types (those requiring real-time data, integration calls) bypass the cache automatically.

Budget Enforcement

Set per-tenant cost limits to prevent runaway spending:

Daily limit -- maximum USD spend per day across all agents in the workspace
Monthly limit -- maximum USD spend per calendar month
Per-agent limit -- optional per-agent cap within the workspace budget
Action on limit -- configurable behavior when budget is reached: block (deny all LLM requests), degrade (route only to cheapest models), or alert (send notification but continue)

Configure budgets via the API or dashboard settings.

Configuration

The llm config object in agent settings controls Cortex behavior:

{
  "llm": {
    "provider": "openai",
    "model": "gpt-4o",
    "simple_model": "gpt-4o-mini",
    "fallback_provider": "anthropic",
    "fallback_model": "claude-sonnet-4-20250514",
    "temperature": 0.7,
    "max_tokens": 4096,
    "system_prompt": "You are a helpful support agent.",
    "context_window_limit": 128000
  }
}

Field	Type	Default	Description
`provider`	string	Required	Primary LLM provider
`model`	string	Required	Primary model for complex queries
`simple_model`	string	Same as `model`	Model for simple queries
`fallback_provider`	string	None	Fallback provider
`fallback_model`	string	None	Fallback model
`temperature`	float	`0.7`	Sampling temperature (0.0-2.0)
`max_tokens`	int	`4096`	Maximum output tokens
`system_prompt`	string	Template default	System prompt prepended to all requests
`context_window_limit`	int	Model default	Override context window limit

API Endpoints

Chat

POST /v1/agents/llm/chat

Send a message and receive a complete response.

curl -X POST https://api.veriswarm.ai/v1/agents/llm/chat \
  -H "x-api-key: vsk_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "agt_123",
    "message": "Summarize our Q1 sales performance",
    "conversation_id": "conv_456"
  }'

Response:

{
  "response": "Based on your Q1 data...",
  "model_used": "gpt-4o",
  "complexity": "complex",
  "tokens": {"input": 2340, "output": 512},
  "cost_usd": 0.0284,
  "conversation_id": "conv_456",
  "cached": false
}

Stream

POST /v1/agents/llm/stream

Same parameters as /chat, but returns a Server-Sent Events (SSE) stream for real-time token delivery.

curl -X POST https://api.veriswarm.ai/v1/agents/llm/stream \
  -H "x-api-key: vsk_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "agent_id": "agt_123",
    "message": "Explain our refund policy",
    "conversation_id": "conv_456"
  }'

Each SSE event contains a delta field with the next token(s). The final event includes model_used, tokens, and cost_usd.

Observability Hooks

Cortex emits structured events at each stage of the request lifecycle. These feed into the agent's conversation logs, Vault audit trail, and any external observability systems you configure.

Hook	Triggered When
`pre_call`	Before sending the request to the LLM provider
`success`	LLM response received successfully
`failure`	LLM request failed (timeout, rate limit, server error)
`retry`	Retrying a failed request (before fallback)
`fallback`	Switching to fallback provider after primary failure
`cooldown`	Provider placed into cooldown state
`cache_hit`	Response served from semantic cache
`budget_exceeded`	Request blocked or degraded due to budget limit
`context_upgrade`	Model upgraded due to context window overflow

Each hook includes the agent ID, conversation ID, model, latency, token counts, and cost estimate.

Related Docs

Fleet -- Agent Management -- agent deployment, configuration, and monitoring
Guard -- PII tokenization and injection scanning during runtime
Widget Chat -- embeddable chat widget powered by Cortex
API Reference -- full endpoint details