Skip to main content
AI Security & GovernanceJun 15, 2026 · 7 min read

148x Slowdown, 85% Hijack Rate: Agent Trust Layers Are Now Attack Targets

Three attacks in 72 hours show how agent safety layers, tool integrations, and supply chain provenance can all become attack surfaces.

By SpringVanta

One poisoned document slowed a LangGraph agent system by 148 times. The attack didn't target the model, the API, or the network. It targeted the safety guardrail itself — the LLM-powered reasoning layer that's supposed to catch dangerous agent actions before they execute.

That research, published June 15 by a team from Hong Kong University of Science and Technology and collaborators, landed in the same 72-hour window as two other attacks that share an uncomfortable pattern. Each one turns a trust mechanism into an attack surface. The safety layer becomes the DoS weapon. The error tracker becomes the injection point. The package provenance system becomes the credential theft pipeline.

Three attacks, one structural problem. The systems designed to make agents trustworthy are themselves made of the same unreliable stuff as the agents they protect.

The safety layer becomes the weapon

The HKUST researchers call it a reasoning-extension denial-of-service attack. Here's the mechanism: a crafted document, placed where an agent will read it (a web page, a repo comment, a tool result), mimics the internal schema that reasoning-based guardrails use to assess risk — risk enumerations, assessment matrices, the kind of structured templates a safety model fills out when evaluating whether an action is safe.

The guardrail model encounters this content and starts filling the template. It reasons. And reasons. And keeps reasoning. A single guardrail call stretched to 730 seconds in a LangGraph deployment. The attack achieved 13 to 63 times token amplification in isolation, and 148 times end-to-end latency in the full multi-agent system.

LangGraph took the worst hit at 148x slowdown. BrowserGym followed at 131x, OpenHands at 36.3x, and OSWorld at 18x. The attack worked across eight different LLM families, and prompts designed for one open-source model were effective against others — so attackers don't need inside knowledge of your specific guardrail implementation.

Standard detection failed. The payload is fluent natural language, not a binary exploit. An injection classifier scored it below 0.001 probability of being malicious. It sailed through.

Here's the part that should make you uncomfortable: every mitigation the researchers tested either fails or creates a new problem. A token-budget cutoff just relocates the failure. Fail-open lets the attack bypass safety entirely. Fail-closed converts it into agent-level DoS that starves co-located agents on shared guardrail infrastructure. A more capable guardrail model performs worse, because stronger reasoning produces longer loops. "A guardrail that can reason can be made to reason forever," as one independent analysis of the paper put it.

Sakshi Grover, senior research manager for cybersecurity services at IDC Asia/Pacific, framed the operational risk: "A successful guardrail DoS doesn't need to breach anything; it just needs to make the system unusable at a critical moment." For automated claims processing, AI-assisted incident response, or real-time fraud detection, that kind of latency spike is a material business event, not a security footnote.

The tool layer becomes the injection point

While the HKUST paper was making the rounds, Tenet Security disclosed a different kind of trust exploitation. They call it Agentjacking, and it needs no malware, no stolen password, no breach.

The vector is Sentry, a widely-used error tracking tool. Sentry lets applications send error reports using a public key (a DSN) that sits openly in website code by design. An attacker posts a fake error report to that endpoint — no authentication required — with a "Resolution" section formatted to look like Sentry's own troubleshooting advice. When a developer asks their coding agent to fix unresolved Sentry issues, the agent reads the planted resolution through the Model Context Protocol and executes the attacker's command.

The agent can't tell a real crash report from a planted one. It treats the tool output as trusted.

The attack worked against Claude Code, Cursor, and Codex with an 85 percent success rate in controlled tests. Tenet identified 2,388 organizations exposed, ranging from a company worth $250 billion down to solo developers, and including at least one cloud-security vendor.

What doesn't catch it is the list of things enterprises spend the most money on. The attack slips past EDR, firewalls, IAM, and VPNs, because nothing in the chain is unauthorized. The agent has legitimate access to Sentry. The error report is a legitimate Sentry feature. The execution happens with the developer's own privileges. Tenet calls this the "Authorized Intent Chain" — the system is working exactly as designed, and that's why it's dangerous.

Sentry acknowledged the problem but declined to fix it at the root, calling it "technically not defensible." They added a filter to block one specific payload string. The broader flaw — that agents treat tool output as trusted input — runs through support tickets, GitHub issues, documentation, and any external system an agent reads.

The supply chain becomes the credential pipeline

On the same day the guardrail research went public, JFrog Security Research published their analysis of IronWorm, a supply chain attack that compromised 37 npm packages through the asteroiddao account.

Three things make IronWorm worth paying attention to, and none of them involve a novel exploit technique.

First, every malicious commit IronWorm pushes uses the author identity claude@users.noreply.github.com. Commit messages are routine: "fix: resolve lint warnings," "test: add missing edge case," "ci: update workflow configuration." Some timestamps are backdated 13 years. In a repository where AI-generated commits are common, these are invisible. A code reviewer seeing a commit from "claude" would assume it came from an AI coding assistant doing its job. The attacker is pretending to be the tool your team already trusts.

Second, IronWorm targets AI credentials specifically. Beyond the standard targets (AWS keys, SSH keys, Docker tokens), it hunts for OpenAI API keys, Anthropic API keys, Claude authentication files, Cursor authentication files, and npm Trusted Publishing OIDC tokens. Stolen AI keys have immediate operational value — an uncapped OpenAI key runs thousands of dollars before anyone notices, and an Anthropic key can power agents that escalate the attack further.

Third, IronWorm propagates through npm's Trusted Publishing mechanism, which was designed to be more secure than stored credentials. It modifies GitHub Actions workflows to request OIDC tokens at runtime, then publishes trojanized packages with valid provenance attestations. The packages pass npm audit signatures. Provenance says "published through a verified CI pipeline." It doesn't say "the CI pipeline was hijacked."

This is the same fundamental gap that Miasma exploited the week before with Red Hat's SLSA provenance. Two independent attacks, one week apart, both defeating provenance through different mechanisms. The trust signal that was supposed to make software supply chains safe is itself made of software, and software can be manipulated.

The pattern: trust layers are made of the same stuff they protect

What connects these three attacks is not a shared vulnerability or a common protocol. It's a structural property.

Reasoning-based guardrails are LLMs. They can be trapped in reasoning loops by crafted text. Error tracking integrations are trusted data sources. They can be poisoned with unauthenticated input. Package provenance systems are software pipelines. They can be hijacked through their own automation.

In each case, the trust mechanism inherits the weaknesses of the technology it was built to govern. A guardrail made of an LLM has all the vulnerabilities of an LLM. A provenance system built on CI/CD has all the vulnerabilities of CI/CD. You're not building a fence around the problem. You're building a fence out of the same material as the problem.

Gartner's Apeksha Kaushik, senior principal analyst, predicts that through 2029, more than 50 percent of successful cybersecurity attacks against AI agents will exploit access control issues using prompt injection as the vector. That prediction is already running ahead of schedule.

Timeline of three agent trust layer attacks from June 12-15, 2026: Guardrail DoS, Agentjacking, and IronWorm

What to do this week

The mitigations from the research aren't silver bullets, but they shift the odds.

Decouple guardrail infrastructure from agent compute. If your safety layer runs on the same resources as your agents, a DoS against the guardrail takes down everything. Run guardrail checks asynchronously or on separate infrastructure so a stalled safety model doesn't starve the agents it protects.

Red-team for availability failures, not just correctness. Most agent security testing checks whether the guardrail blocks dangerous actions. Almost none test what happens when the guardrail itself is overloaded. Add availability failure modes to your red-teaming checklist.

Treat tool output as untrusted input. Agentjacking works because agents treat data from Sentry, GitHub issues, and other tools as reliable. If your agent reads external data through MCP or any tool integration, that data needs the same validation you'd apply to user input on a web form.

Monitor for anomalous reasoning depth. The guardrail DoS attack produces a measurable signal: reasoning traces that are dramatically longer than baseline. Set alerts for guardrail calls that exceed normal latency by an order of magnitude.

Verify package provenance beyond the attestation. npm's Trusted Publishing is only as trustworthy as the CI pipeline behind it. Check whether the publishing workflow itself has been modified recently, not just whether the signature is valid.

The organizations that treat agent infrastructure the way they treat critical application infrastructure — with redundancy, monitoring, and active testing — will weather these attacks. The ones that bolted a reasoning model onto their agents and called it governance will find out what "the safety layer is the attack surface" means the hard way.

Sources

Read more

Like this kind of writing?

One email when something good ships — usually once or twice a month.