Cortex — Intelligent Runtime

Cortex is VeriSwarm's intelligent runtime layer for LLM operations. It provides cost analytics, smart routing, semantic caching, token compression, and an OpenAI-compatible proxy -- everything you need to control LLM spend and performance without changing your application code.

Cost Analytics

Track LLM spend across models, agents, and time periods. Cost analytics are read-only and available on all plans, including Free.

Summary

GET /v1/analytics/costs?period=month|week|day

Returns total cost, token counts, request counts, and a per-model breakdown for the given period.

curl -H "x-api-key: YOUR_API_KEY" \
  "https://veriswarm.ai/v1/analytics/costs?period=month"

Trend

GET /v1/analytics/costs/trend?days=30

Daily cost time series. Useful for dashboards and burn-rate alerts.

Per-Agent Costs

GET /v1/analytics/costs/agents?days=30

Cost ranked by agent. Identifies which agents are driving spend.

Budget Utilization

GET /v1/analytics/costs/budget

Current spend vs configured budget limits. Returns utilization percentage and remaining budget.

Savings

GET /v1/analytics/costs/savings?days=30

Savings attributed to routing rules, semantic caching, and token compression over the given window.

LLM Routing Rules

Route prompts to different models based on task patterns. Useful for steering cheap tasks to cheaper models while keeping quality-sensitive tasks on premium models.

List Rules

GET /v1/analytics/routing-rules

Create Rule

POST /v1/analytics/routing-rules

curl -X POST -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Summaries to GPT-4o-mini",
    "task_pattern": "summarize-*",
    "preferred_model": "gpt-4o-mini",
    "fallback_model": "gpt-4o",
    "optimization_mode": "cost",
    "priority": 10
  }' \
  "https://veriswarm.ai/v1/analytics/routing-rules"

Fields:

Field	Type	Description
`name`	string	Human-readable rule name
`task_pattern`	string	Glob-style pattern matched against prompt text
`preferred_model`	string	Model to route matching prompts to
`fallback_model`	string	Model to use if preferred is unavailable
`optimization_mode`	string	`cost`, `quality`, or `balanced`
`priority`	integer	Lower numbers evaluate first

Delete Rule

DELETE /v1/analytics/routing-rules/{rule_id}

Semantic Response Cache

Cortex caches LLM responses using TF-IDF vectorized embeddings with cosine similarity matching. No external embedding API is required -- all computation is local.

How it works

Exact match -- incoming prompt is SHA-256 hashed. If the hash matches a cached entry, the cached response is returned immediately.
Similarity match -- if no exact match, the prompt is vectorized with TF-IDF and compared against cached entries using cosine similarity. Entries above the configured threshold are returned as cache hits.
Miss -- if neither tier matches, the request proceeds to the LLM and the response is cached for future lookups.

Cache entries are tenant-scoped and respect a configurable TTL.

Cache Analytics

GET /v1/analytics/cache?days=30

curl -H "x-api-key: YOUR_API_KEY" \
  "https://veriswarm.ai/v1/analytics/cache?days=30"

Returns hit rate, total lookups, tokens saved, and estimated cost saved.

Token Compression

Reduce token usage without sacrificing output quality. Three layers of compression work together to trim prompt size and context window usage.

Prompt Compression

Level	What it does
`light`	Whitespace normalization
`medium`	Light + filler phrase removal
`aggressive`	Medium + sentence deduplication via Jaccard overlap

Context Compression

Preserves the system message and the most recent N messages. Older turns are summarized into a compact representation, freeing context window space for new content.

Semantic Dedup

Removes near-duplicate messages within the conversation using a configurable similarity threshold. Prevents repetitive context from inflating token counts.

Configuration

GET /v1/analytics/compression-config
PUT /v1/analytics/compression-config

curl -X PUT -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt_compression": "medium",
    "context_compression_enabled": true,
    "max_context_tokens": 4096,
    "semantic_dedup_enabled": true,
    "semantic_dedup_threshold": 0.85
  }' \
  "https://veriswarm.ai/v1/analytics/compression-config"

Config fields:

Field	Type	Default	Description
`prompt_compression`	string	`disabled`	`disabled`, `light`, `medium`, or `aggressive`
`context_compression_enabled`	boolean	`false`	Summarize older conversation turns
`max_context_tokens`	integer	`4096`	Max tokens to keep before summarizing
`semantic_dedup_enabled`	boolean	`false`	Remove near-duplicate messages
`semantic_dedup_threshold`	float	`0.85`	Similarity threshold for dedup (0-1)

All compression is opt-in and disabled by default.

LLM Proxy

A drop-in replacement for OpenAI's API. Point any OpenAI-compatible SDK at VeriSwarm's proxy endpoint and all requests flow through Cortex's routing rules, caching, and compression pipeline.

List Models

GET /v1/proxy/models

Chat Completions

POST /v1/proxy/chat/completions

curl -X POST -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {"role": "user", "content": "Explain agent trust scoring in one paragraph."}
    ]
  }' \
  "https://veriswarm.ai/v1/proxy/chat/completions"

SDK Integration

Point any OpenAI SDK at the proxy base URL:

Python:

from openai import OpenAI

client = OpenAI(
    base_url="https://veriswarm.ai/v1/proxy",
    api_key="YOUR_API_KEY",  # Your VeriSwarm API key
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)

Node:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://veriswarm.ai/v1/proxy",
  apiKey: "YOUR_API_KEY", // Your VeriSwarm API key
});

const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Hello" }],
});

Plan Availability

Feature	Free	Pro	Max	Enterprise
Cost analytics	Read-only	Full	Full	Full
Routing rules	--	Up to 5	Unlimited	Unlimited
Transform pipeline	--	Up to 5	Unlimited	Unlimited
Semantic cache	--	Yes	Yes	Yes
Token compression	--	Yes	Yes	Yes
LLM proxy	--	Yes	Yes	Yes

Authentication

All Cortex endpoints accept either authentication method:

x-api-key header -- your platform API key
x-account-access-token header -- user session token from login

Both resolve to your tenant. Use whichever fits your integration pattern.