Prompt Injection Doesn't Stop at the LLM. It Flows Through Tool Calls.
Published April 30, 2026
You hardened the prompt. You added system instructions telling the model to ignore embedded directives. You ran red-team evals. The model passes them.
Then your agent reads a support ticket, and a hidden line in the ticket body tells it to query the user's billing record and post the result to a webhook the attacker controls. The model never violated its system prompt. It called a tool. The tool ran. The data left.
This is the part of prompt injection that prompt-layer defenses do not see. The payload doesn't change the model's behavior in any way that looks like a jailbreak. It changes which tool gets called, with which arguments, on whose data. And by the time the call reaches the tool server, the model is no longer in the loop — the protocol is.
The attack does not live in the prompt
The clean mental model for prompt injection — attacker writes a malicious instruction, model obeys it — was always a partial picture. The version that has matured in 2026 is structural.
In indirect prompt injection, the malicious instruction is not in the user's message. It is hidden inside an artifact the model is asked to read: a web page, a PDF, a README, a code comment, package metadata, the description field of an MCP tool. Once the model ingests that artifact, the embedded instructions can cause it to run a tool, exfiltrate data, or modify files — without the operator ever seeing a suspicious prompt (Lakera).
A specific variant — tool poisoning — buries the payload in the tool descriptions the MCP server publishes to the agent. The agent reads the tool catalog at startup. The poisoned description tells the model that calling a particular tool requires also calling another tool first, with the user's secrets attached. To the model, this looks like documentation. To the operator, the resulting tool call looks routine (OWASP MCP Tool Poisoning).
The pattern survived peer review. A March 2026 meta-analysis of 78 studies on prompt injection found that attack success rates against state-of-the-art defenses exceed 85% when adaptive strategies are used, and that even Claude 3.7-Sonnet — the most refusal-prone model in the study — refused these attacks less than 3% of the time (MDPI, Information 17(1):54).
The numbers say the prompt layer is not where the fight is being lost or won.
Why the fix doesn't live at the prompt either
There are three reasons hardening the prompt cannot finish this job.
The first is that the model is doing what it was asked. From the model's perspective, "the tool description says to also call this other tool first" is just instruction-following. There is no malicious intent to refuse. The pathology is in the tool catalog, not the user message.
The second is the rug-pull. A tool can pass review on Monday and ship a malicious description update on Friday. Most agent clients accept whatever metadata the server returns — a recent academic survey found that five of seven evaluated MCP clients implemented no static validation of tool descriptions at all (arxiv 2603.22489). Approving a tool once does not approve every future version of it.
The third is the receipt. Even when prompt-layer scanners flag a suspicious instruction, they do not produce evidence an auditor can use. The model "decided" to call the tool. The tool ran. There is no signed record of what payload arrived at the tool server, what came back, or what was redacted on either side. The EU AI Act's August 2026 obligations on traceability ask for that record. Prompt scanners do not produce it.
The defense has to live somewhere downstream of the model and upstream of the tool server. That is the only point in the path where the actual call — the one that touches data — can be inspected, modified, and recorded.
The interception layer
VeriSwarm Guard Proxy is built around that exact location. It is a transparent layer between an agent and its MCP tool servers. Every request and every response passes through it. Three deployment modes — cloud-hosted, Docker, or a local stdio binary — let the same policy surface follow the agent into customer VPCs, on-prem environments, or a developer's laptop without changing what runs inside the proxy.
What runs inside the proxy is the relevant part for this attack class.
Tool descriptions are validated, not trusted. Guard scans tool definitions for known poisoning patterns — embedded instructions, hidden directives, schema fields that smuggle behavior into metadata — before the agent is allowed to use them. When a known-good tool ships a new description, the change is rescanned, not waved through. This closes the rug-pull window that academic surveys flagged this year.
Tool-call payloads are scanned for injection markers. The proxy inspects argument values for the patterns indirect injection uses to chain tool calls — encoded instructions, hidden delimiters, attempts to redirect output to attacker-controlled destinations. PII inside those arguments is tokenized in the same pass, so even when a call passes the injection check, it is not also a data exfiltration channel. Both checks run in a deterministic pipeline, in a fixed order, on every call.
Risk events feed the trust score. When Guard flags an injection attempt, that event is ingested by Gate. The agent's risk score moves. If the score crosses a tenant-defined threshold, the agent's policy tier auto-demotes from allow to review or deny — the next tool call doesn't run on a human's say-so, the next tool call doesn't run at all. The kill switch is the floor, not the ceiling.
Every step is recorded immutably. The original payload, the redacted payload, the policy decision, the tool response, and any subsequent score change are all written to the Vault hash-chained ledger. Six months from now, when a regulator or an enterprise security team asks what the agent did when it received that ticket, there is a single, integrity-verifiable record to produce. Logs are evidence; tamper-evident logs are proof.
What this looks like for a builder today
Put Guard Proxy between your agents and your tools. Run it cloud-hosted while you evaluate, then move it into a customer VPC or a local binary as you scale. Turn on PII tokenization and tool-description scanning on day one. Wire Gate's risk score into the policy tier so a flagged injection attempt costs the agent its allow privilege automatically.
The CVEs are real, the tool poisoning research is peer-reviewed, and the EU AI Act deadline is on the calendar. Prompt-layer hardening was always going to be necessary and was never going to be sufficient. The injection pivots to the tool call. The defense has to meet it there.
VeriSwarm Guard Proxy is part of the Max plan; PII tokenization and tool-call scanning are included. Gate trust scoring is on the free tier. Start at veriswarm.ai.