Agent Scoring Is Not LLM Evaluation. Here's the Difference.
Your eval suite said the model scored 94% on BoolQ. Your agent still leaked a customer's SSN on Tuesday.
Those two sentences aren't contradictory. They're the gap the industry keeps pretending doesn't exist.
There's a growing pile of tools that claim to "evaluate" AI agents — Braintrust, Patronus, Arize, Langfuse, the now-Anthropic-owned HumanLoop, plus roughly a dozen observability platforms that sprouted over the last eighteen months. They're all useful. They're also all solving a different problem from the one your agent is actually going to fail on. Lumping them together with runtime agent trust is how enterprises end up with beautiful eval dashboards and zero ability to stop an agent mid-transaction when it goes sideways.
So let's pull them apart.
What evaluations actually do
An LLM evaluation is a controlled, offline measurement. You pick a fixed dataset — BoolQ, MMLU, HumanEval, a custom golden set — feed it through the model, and compute a score. Did it answer correctly? Was the JSON well-formed? Did an LLM-as-judge think the response was coherent?
Evals are excellent at one thing: telling you whether your model's capability surface is where you need it before you ship. They're how you decide whether Claude 4.6 beats GPT-5 for your use case, whether your fine-tune regressed, whether the new retrieval chain actually grounds answers better than the old one. Braintrust, Patronus, Arize, Langfuse, Confident AI — the crowded middle of this market — all compete on some version of this loop: CI-style eval runs, golden datasets, LLM-as-judge graders, regression tracking between model versions.
All of that happens before, or parallel to, production. It's lab work.
What observability adds — and where it stops
The newer move, which the same vendors have mostly pivoted toward, is production tracing. Spans per agent turn. Tool-call graphs. Token accounting. Latency per hop. Post-hoc replay of what the agent did across a multi-step session. This is genuinely useful — the 2026 buyer's guides from Braintrust and others all correctly flag that single-turn eval doesn't capture compounding errors across agent loops — and the industry has converged on OpenTelemetry-style tracing to handle it.
But observability is, by construction, retrospective. Braintrust's own comparison write-up says it plainly: "the platform enables teams to trace agent decisions, monitor latency, and review logs after the fact, but it lacks intervention capabilities for complex multi-step workflows." That's the whole category. Trace it. Log it. Score it after. Pull a dashboard. File a ticket.
Which is fine if the worst outcome is a bad answer. It isn't fine if the worst outcome is a tool call that just exfiltrated PII, wired money, or posted to a customer channel on your behalf. You don't want to read about that in the Monday retro. You want the action to have been denied at 09:04:17.3 on Tuesday.
That is not what an eval or observability platform is architected to do.
Where scoring sits — and why it's different
VeriSwarm Gate isn't an eval tool. It isn't an observability tool either, though it emits a stream that looks superficially similar. It's an authorization layer driven by continuous behavioral scoring.
Four dimensions are live per agent: identity (is this the agent it claims to be, via Passport), risk (what's the aggregate signal right now — injection flags, anomaly scores, tool-call patterns), reliability (how grounded are its outputs, how often does it succeed on routine actions), and autonomy (what scope is it operating in). Those scores update from a 22-event taxonomy the SDK emits, and feed a policy engine — Cedar for custom rules, a sensible default matrix otherwise — that returns Allow, Review, or Deny at decision time, per action, in single-digit milliseconds.
Same agent can be Allow for read-KB at 09:00, Review for refund-issue at 09:02 after a reliability dip, and Deny for external-send at 09:04 the moment an injection signal crosses threshold. Nobody edited a role. Nobody opened a dashboard. The authorization surface moved because the behavior moved.
The eval platforms can't do this because they weren't built to. Their hot path is grading outputs after they've been produced. Gate's hot path is gatekeeping actions before they execute.
How they fit together
This isn't a "replace your eval vendor" pitch. A mature agent stack needs all three layers, and they don't overlap:
- Evaluation is pre-production. It answers "is the model good enough to ship?" Braintrust, Patronus, and the eval vendors own this.
- Observability is post-production. It answers "what did my agent actually do across this multi-turn session?" Arize, Langfuse, Braintrust's tracing, and OpenTelemetry-based tools own this.
- Behavioral scoring with authorization is in-production, in-line. It answers "should this specific action, from this specific agent, with this specific risk posture, be permitted right now?" That's what Gate exists for, and it's the layer almost no one else ships — Microsoft's Agent Governance Toolkit is the closest parallel, with its 0–1000 trust score feeding sub-millisecond policy decisions.
The eval industry has correctly identified that single-number scores don't capture agent behavior. The conclusion most of them drew was: add more tracing. The conclusion that matters for anyone shipping agents to production is: add decisions. Scoring without enforcement is a graph; enforcement without scoring is a static firewall. You need both, co-located, sub-millisecond, with an audit trail that survives regulatory scrutiny.
Why this matters before August
EU AI Act Article 14 enforcement begins August 2, 2026. "Demonstrable human oversight commensurate with risk" is not satisfied by an eval score from the pre-deployment phase. It's not satisfied by a retrospective trace dashboard, either. What an auditor needs, concretely, is: what the agent tried to do, what its score was at that instant, which tier the policy returned, whether the action was allowed or blocked, and — if it was reviewed by a human — who, when, and what they said. That stream exists in Gate, written to Vault's hash-chained ledger so it can't be retroactively cleaned up. It's the shape of an oversight record that holds up.
The short version
Evals grade the model. Observability traces the session. Scoring authorizes the action. They're three different jobs. If you only buy the first two, your production failure mode is a well-instrumented incident.
If you want to see the scoring-plus-authorization layer live against your own agents: Gate's event ingestion, scoring, and default policy tiers are on the free plan. Cedar policy editing and Vault export are what you turn on when the auditor shows up.