Verifying a Vault Chain: A Runbook for the Day Integrity Breaks

Hash-chained audit logs are widely written about and rarely operated. Most teams stop at the architecture diagram. The harder question — the one that actually matters when an auditor or a paranoid customer is on the phone — is the operational one: what does the verifier do, what does it return, and what do you do when the answer comes back wrong?

This post walks Vault's chain verification end to end: request shape, response payload, break detection logic, and the runbook for the day a break shows up. No regulatory framing, no "why tamper-evidence matters" detour. Those rounds were already written. This is the operator post.

The endpoint

Vault chain verification lives at a single HTTP endpoint:

GET /v1/suite/vault/verify

It accepts either of the two standard VeriSwarm auth headers — a tenant API key (x-api-key) or a user session token (x-account-access-token). It is rate-limited at 10 requests per minute per tenant per scope, plan-gated to Vault entitlement, and takes no query parameters. By default it verifies the entire chain for the calling tenant from the earliest event forward. That default is deliberate: a partial verification cannot detect tampering of older history, which defeats the point of the exercise.

A minimum-viable invocation:

curl -H "x-api-key: $VERISWARM_KEY" \
     https://api.veriswarm.ai/v1/suite/vault/verify

That's the whole verification protocol from the caller's side. The cryptographic work happens server-side; the result is a JSON document an SRE can ingest, an auditor can paste into a report, or a CI job can assert against.

The response shape

A clean run returns five fields:

{
  "ok": true,
  "checked_count": 42317,
  "total_count": 42317,
  "partial": false,
  "errors": []
}

Read ok first; it is the binary integrity verdict. Read checked_count against total_count next; if they diverge, partial: true will explain why and the result is not a full-chain attestation. Read errors last; on a clean chain it will always be empty.

The same shape returns on failure. Only the values change:

{
  "ok": false,
  "checked_count": 42317,
  "total_count": 42317,
  "partial": false,
  "errors": [
    {
      "event_id": "evt_evd_a1b2c3d4e5f6…",
      "reason": "chain_link_mismatch",
      "expected_previous": "9f3a…b21c",
      "actual_previous": "8e2d…c33a"
    }
  ]
}

The errors array is capped at fifty entries to keep failure responses bounded — but the verifier does not stop at the first break. It keeps iterating to surface the full extent of corruption up to the cap, which matters because real-world tampering often manifests as a contiguous run of broken links rather than a single edited row. Seeing two breaks vs. two hundred is a different incident.

What a break actually looks like

The verification logic is small enough to summarize in one paragraph. Each Vault event stores its own content_hash and the previous_event_hash it was chained to at write time. The verifier walks events in monotonic id order, holds the previous entry's content_hash in memory, and checks it against the current entry's recorded previous_event_hash. A mismatch is a chain_link_mismatch. That is the only failure mode the verifier reports, because it is the only failure mode that hash-chained integrity is designed to catch.

What produces one in practice: a direct UPDATE on the events table that bypassed the application layer (the ORM blocks mutation, but a database admin with a psql session is outside that perimeter); a failed restore from backup that reconstituted some events but not their predecessors; a migration script that touched event payloads without recomputing hashes; or a malicious actor with database write access.

The verifier cannot tell you which of those happened. It can tell you exactly which event_id is the first break and what the recomputed and stored hashes were at that point. From there, a human walks the database and the application logs to figure out the root cause.

Verifying at scale

For tenants with millions of events the verifier iterates in batches of 1,000 rather than loading the chain into memory. Complexity is linear in chain length — there is no inclusion proof shortcut today, and the tradeoff is deliberate. Merkle inclusion proofs reduce single-event verification to O(log N) sibling hashes, which is the right shape for a public certificate transparency log where any visitor might want to cheaply prove one entry was logged. (Trillian's verifiable data structures document the canonical version.)

Audit verification is a different shape. An auditor or a regulator wants a full-chain integrity statement, not a single-event inclusion proof. A linear forward walk gives that, with deterministic ordering by primary key id. Teams that need cheaper single-event proofs can layer them on top — additive, not a replacement.

The runbook for `ok: false`

A failed verification is not a routine event, but it is a foreseeable one, and an operator should know what the next thirty minutes look like before they happen. The sequence:

Capture the response. Store the full JSON immediately, including the errors array. The response is the only authoritative statement of which events broke and what hashes were observed at break time. Do not re-run the verifier yet — a second run on tampered data may produce a different errors window depending on the cap.
Freeze writes. Block new event ingestion for the affected tenant. Continuing to write while a break is unresolved chains the new events to a corrupted prefix.
Locate the first break. The first entry in the errors array is the earliest detected failure. The event_id and the expected_previous vs actual_previous hashes are the forensic anchors — those are what go into the incident timeline.
Pull the database row. Read the actual stored row for that event_id and its predecessor. Compare against the most recent backup that predates the break. If the row was edited, the diff is now your investigation scope.
Walk the application logs. Cross-reference the event_id against API access logs, recent migration timestamps, and any direct-database access events. The 2026 Verizon DBIR puts third-party involvement at 30% of breaches — double the prior year — so the third-party access surface is a high-prior place to look first.
Document the incident. What verification ran, when it failed, what was found, what was restored. For high-risk systems under EU AI Act Annex III, the break itself is a record-keeping event auditors will eventually ask about.

A clean verification run is not exciting. That's the goal. The work is making sure that when the verifier does fail, every step from there is deterministic.

Try it

Vault is on the Max plan; the verification endpoint ships with it. The free Gate tier records policy decisions to the same hash-chained structure, which means you can run a verification today against your decision log and see the response shape end to end before Vault's full event scope becomes relevant. The signup flow takes about a minute. Run a verification immediately after — empty chains return ok: true, total_count: 0, which is the correct baseline.

A chain you can verify is a chain you can answer questions about. That's the whole proposition.