The Fifth Dimension: Measuring Whether an Agent Knew What It Didn't Know
Published June 4, 2026
Most trust systems measure outcomes. We're adding the one that measures whether the agent knew what it didn't know.
Here is the failure mode that no outcome-based score catches. Two agents run the same hundred tasks. Both complete eighty-three of them. On the seventeen they get wrong, one agent flagged low confidence on fourteen — it knew it was on shaky ground and said so. The other reported high confidence on all seventeen failures. It was wrong and certain, every time.
Reliability scores them identically. Both are at 83%. But one of them is safe to hand a clinical-triage workflow and the other is a liability waiting for an incident report. The dimension that tells them apart is calibration, and as of 2026-Q3 it is the fifth dimension VeriSwarm Gate scores.
The problem is structural, not anecdotal
This isn't a corner case. Overconfidence is the default behavior of the models these agents run on, and the research is unambiguous about it.
The ICLR 2026 work on verbalized confidence found that LLM-generated confidence scores are systematically miscalibrated — models report high confidence on instances where they have low accuracy, which is precisely the direction that harms trust and safety. In a zero-shot setting, models "predominantly assign high confidence scores (8 or above)," a pattern attributed to supervised pretraining that rewards confident-sounding output. The FermiEval calibration study put the same finding in its title: LLMs are Overconfident. And the mechanistic work in "Wired for Overconfidence" traced the inflation signal to a specific set of MLP blocks and attention heads in the middle-to-late layers — overconfidence is written into the model at the architecture level, present across all model sizes.
The crucial detail for anyone running production agents: the same survey work notes that most calibration improvement comes from increased accuracy, not from reduced overconfidence. In other words, a better model is not a better-calibrated model. You can swap in a stronger LLM, watch your task success rate climb, and still be flying an agent that is confidently wrong on the tasks it fails. Accuracy and calibration are different axes. You have to measure them separately or you don't see it.
That is the entire argument for a fifth dimension.
What calibration actually measures
The four dimensions Gate has scored since launch — identity, risk, reliability, autonomy — all answer questions about what happened. Did the agent complete the task? Did it trip a security signal? Has it earned the right to act unsupervised? Calibration asks a different question: when the agent told you how sure it was, was it right to be that sure?
The mechanism is the Brier score, the standard proper scoring rule for probabilistic forecasts. For each task, the agent reports a confidence p between 0 and 1, and the outcome o resolves to 1 (success) or 0 (failure). The per-task Brier contribution is (p − o)² — the squared gap between how sure the agent was and what actually happened. An agent that says 0.95 and succeeds scores a tiny penalty (0.0025); an agent that says 0.95 and fails takes a large one (0.9025). Confident-and-wrong is the most expensive thing you can do, which is exactly the behavior we want the score to punish.
Gate maintains a rolling Brier average over a sliding window of recent confidence-outcome pairs, then surfaces it as calibration_score = round((1 − brier_avg) × 100) so it reads on the same 0–100 scale as the other dimensions. Under the hood, the engine also retains the Murphy 1973 decomposition — splitting the Brier score into reliability, resolution, and uncertainty components — so the diagnosis can distinguish an agent that is uniformly overconfident from one whose confidence carries no discriminating signal at all. Same headline number, different remediation.
Two new events, no new instrumentation burden
Calibration runs on two additions to the standardized event taxonomy, which grew from 22 to 24 types to support it:
agent.confidence_reported— the agent declares its confidence on a task before (or as) it acts.agent.task_outcome— the ground-truth resolution of that task, success or failure.
Gate pairs them by task identifier, computes the Brier contribution, and folds it into the rolling window. The SDKs emit both with a single call each, and an agent that already reports task completion is most of the way there. Crucially, the agent doesn't change what work it does — it just annotates that work with a confidence signal and a resolution. The instrumentation is the same shape as everything else in the taxonomy: emit once, score continuously.
Calibration is weighted per industry
A fifth dimension only earns its place if it bends decisions, and the weight it carries is not uniform across verticals — because the cost of confident-wrongness isn't either. Calibration is part of the same tenant-scoped scoring-profile system that already lets a healthcare agent and an e-commerce agent run opposite trust physics on one configuration surface.
The per-vertical calibration weights that ship with the calibration-aware profiles:
| Vertical | Calibration Weight |
|---|---|
| Healthcare | 0.35 |
| Legal | 0.30 |
| Financial services | 0.25 |
| Security | 0.25 |
| Software | 0.20 |
| General | 0.20 |
| E-commerce | 0.15 |
Healthcare leans on calibration hardest, at 0.35 — a confidently-wrong clinical decision is the precise failure the dimension exists to catch. E-commerce sits lowest at 0.15, where transaction-scope risk and identity confidence dominate over the agent's self-assessment. Same dimension, seven weights, one configuration surface. Eleven scoring profiles ship in total, seven of them calibration-aware verticals.
Where this is going: confidence as a routing signal
Scoring calibration is the foundation. The next move — landing in the sprint that follows GA — is to make confidence actionable at decision time, not just observable after the fact. The model for high-stakes verticals like healthcare routes on the reported confidence itself:
- Below 0.50 — autonomous execution denied outright.
- 0.50 to 0.75 — human review required before the action proceeds.
- 0.75 to 0.90 — enhanced validation path.
- Above 0.90 — standard workflow.
The calibration score is what makes that routing trustworthy. A confidence threshold is only meaningful if the agent's confidence is honest — and the only way to know whether it's honest is to have scored, over a rolling history, how well its past confidence predicted its past outcomes. Calibration the dimension is what licenses confidence the gate.
Why it's a dimension, not a feature
Calibration is orthogonal to the other four. An agent can have impeccable identity, clean risk, high reliability, and earned autonomy, and still be badly calibrated — confidently wrong on the exact tasks it fails. That orthogonality is the test for whether something deserves to be its own dimension rather than a sub-signal folded into reliability. It passes: you cannot reconstruct the calibration score from the other four, and the cases where it diverges from reliability are exactly the cases that produce incidents.
It also doesn't replace anything. The four-dimension model that shipped through Q2 2026 remains accurate; calibration is additive. An agent's composite trust now reflects whether it completes its work, whether it stays clean, whether it's earned independence — and whether it knows the difference between what it knows and what it's guessing.
That last one is the difference between an agent you can supervise and an agent you can trust.
Calibration scoring is live on the free Gate tier alongside the other four dimensions. The two new taxonomy events emit from the Python and Node SDKs with one call each.
Sources cited:
- "Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLMs" — arxiv 2604.01457
- "LLMs are Overconfident: Evaluating Confidence Interval Calibration with FermiEval" — arxiv 2510.26995
- "Calibrating Verbalized Confidence with Self-Generated Distractors" (ICLR 2026) — arxiv 2503.02623
Related Reading
- Pillar: Agent Trust Scoring — A Technical Guide — the full technical reference covering the five dimensions, the 24-event taxonomy, scoring profiles, and the policy-tier decision flow.
- Background: Identity, Risk, Reliability, Autonomy: Why One Trust Score Isn't Enough — the original four-dimension framing this post extends.
- Configuration: One Size Doesn't Fit All: Trust Thresholds for Healthcare vs. E-Commerce — how per-industry scoring profiles, including calibration weight, collapse into one tenant-scoped configuration.