Cortex is VeriSwarm's intelligent runtime layer for LLM operations. It provides cost analytics, smart routing, semantic caching, token compression, and an OpenAI-compatible proxy -- everything you need to control LLM spend and performance without changing your application code.
Track LLM spend across models, agents, and time periods. Cost analytics are read-only and available on all plans, including Free.
GET /v1/analytics/costs?period=month|week|day
Returns total cost, token counts, request counts, and a per-model breakdown for the given period.
curl -H "x-api-key: YOUR_API_KEY" \
"https://veriswarm.ai/v1/analytics/costs?period=month"
GET /v1/analytics/costs/trend?days=30
Daily cost time series. Useful for dashboards and burn-rate alerts.
GET /v1/analytics/costs/agents?days=30
Cost ranked by agent. Identifies which agents are driving spend.
GET /v1/analytics/costs/budget
Current spend vs configured budget limits. Returns utilization percentage and remaining budget.
GET /v1/analytics/costs/savings?days=30
Savings attributed to routing rules, semantic caching, and token compression over the given window.
Route prompts to different models based on task patterns. Useful for steering cheap tasks to cheaper models while keeping quality-sensitive tasks on premium models.
GET /v1/analytics/routing-rules
POST /v1/analytics/routing-rules
curl -X POST -H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Summaries to GPT-4o-mini",
"task_pattern": "summarize-*",
"preferred_model": "gpt-4o-mini",
"fallback_model": "gpt-4o",
"optimization_mode": "cost",
"priority": 10
}' \
"https://veriswarm.ai/v1/analytics/routing-rules"
Fields:
| Field | Type | Description |
|---|---|---|
name |
string | Human-readable rule name |
task_pattern |
string | Glob-style pattern matched against prompt text |
preferred_model |
string | Model to route matching prompts to |
fallback_model |
string | Model to use if preferred is unavailable |
optimization_mode |
string | cost, quality, or balanced |
priority |
integer | Lower numbers evaluate first |
DELETE /v1/analytics/routing-rules/{rule_id}
Cortex caches LLM responses using TF-IDF vectorized embeddings with cosine similarity matching. No external embedding API is required -- all computation is local.
Cache entries are tenant-scoped and respect a configurable TTL.
GET /v1/analytics/cache?days=30
curl -H "x-api-key: YOUR_API_KEY" \
"https://veriswarm.ai/v1/analytics/cache?days=30"
Returns hit rate, total lookups, tokens saved, and estimated cost saved.
Reduce token usage without sacrificing output quality. Three layers of compression work together to trim prompt size and context window usage.
| Level | What it does |
|---|---|
light |
Whitespace normalization |
medium |
Light + filler phrase removal |
aggressive |
Medium + sentence deduplication via Jaccard overlap |
Preserves the system message and the most recent N messages. Older turns are summarized into a compact representation, freeing context window space for new content.
Removes near-duplicate messages within the conversation using a configurable similarity threshold. Prevents repetitive context from inflating token counts.
GET /v1/analytics/compression-config
PUT /v1/analytics/compression-config
curl -X PUT -H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt_compression": "medium",
"context_compression_enabled": true,
"max_context_tokens": 4096,
"semantic_dedup_enabled": true,
"semantic_dedup_threshold": 0.85
}' \
"https://veriswarm.ai/v1/analytics/compression-config"
Config fields:
| Field | Type | Default | Description |
|---|---|---|---|
prompt_compression |
string | disabled |
disabled, light, medium, or aggressive |
context_compression_enabled |
boolean | false |
Summarize older conversation turns |
max_context_tokens |
integer | 4096 |
Max tokens to keep before summarizing |
semantic_dedup_enabled |
boolean | false |
Remove near-duplicate messages |
semantic_dedup_threshold |
float | 0.85 |
Similarity threshold for dedup (0-1) |
All compression is opt-in and disabled by default.
A drop-in replacement for OpenAI's API. Point any OpenAI-compatible SDK at VeriSwarm's proxy endpoint and all requests flow through Cortex's routing rules, caching, and compression pipeline.
GET /v1/proxy/models
POST /v1/proxy/chat/completions
curl -X POST -H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "user", "content": "Explain agent trust scoring in one paragraph."}
]
}' \
"https://veriswarm.ai/v1/proxy/chat/completions"
Point any OpenAI SDK at the proxy base URL:
Python:
from openai import OpenAI
client = OpenAI(
base_url="https://veriswarm.ai/v1/proxy",
api_key="YOUR_API_KEY", # Your VeriSwarm API key
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
Node:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://veriswarm.ai/v1/proxy",
apiKey: "YOUR_API_KEY", // Your VeriSwarm API key
});
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Hello" }],
});
| Feature | Free | Pro | Max | Enterprise |
|---|---|---|---|---|
| Cost analytics | Read-only | Full | Full | Full |
| Routing rules | -- | Up to 5 | Unlimited | Unlimited |
| Transform pipeline | -- | Up to 5 | Unlimited | Unlimited |
| Semantic cache | -- | Yes | Yes | Yes |
| Token compression | -- | Yes | Yes | Yes |
| LLM proxy | -- | Yes | Yes | Yes |
All Cortex endpoints accept either authentication method:
x-api-key header -- your platform API keyx-account-access-token header -- user session token from loginBoth resolve to your tenant. Use whichever fits your integration pattern.