Skip to main content
AI Security & GovernanceJun 3, 2026 · 6 min read

The Agentic Reckoning: Models Are Ready. The Pipes Are Not.

Three events in 72 hours confirm the same diagnosis: AI agent failures trace to runtime infrastructure and data context, not model quality.

By SpringVanta

Three things happened between June 1 and June 3, 2026. Separately, they're product announcements. Together, they describe the actual problem enterprises are hitting with AI agents, and it's not what most of the marketing says.

At Snowflake Summit in San Francisco, the company launched Horizon Context and Cortex Sense, a two-layer system built to stop AI agents from giving confident wrong answers because they're reasoning over different definitions of the same business data.

At Microsoft Build, the company open-sourced ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) and the Agent Control Specification (ACS), a portable runtime control standard that lets teams place hard policy checks at every point where an agent can go off the rails.

And VentureBeat published its "Agentic Reckoning" pulse survey of 132 enterprise technology leaders. The finding: 84% say agent failures trace back to infrastructure and governance, not the model. Only 17% blame the model itself.

Three different vendors, three different angles, same diagnosis. The models are ready enough. The pipes are not.

The runtime problem, by the numbers

The VentureBeat survey is worth reading in full. A few numbers that stood out:

  • 77% of engineering teams are spending meaningful weekly capacity on infrastructure plumbing (retry logic, state persistence, checkpointing) rather than agent intelligence.
  • 29% say the ROI ceiling is the primary reason agent projects don't reach production. Token costs and orchestration overhead eat the business case before the agent gets good enough.
  • 24% report hallucination propagation as the leading technical obstacle: small reasoning errors in early steps that compound into total system failure by step 10 or 12.
  • 20% cite ghost failures, silent API timeouts where the agent hangs without a traceback. The researchers note these are probably undercounted because they're invisible by definition.

The money quote, from a Director of Engineering at a financial services company with more than 10,000 employees: "The models are smart enough, but our stateless infrastructure is too fragile to manage long-running, multi-step agentic processes."

One more data point that should make buyers pay attention: 59% of respondents are either actively migrating to durable execution frameworks or evaluating them specifically to enforce data boundaries and deterministic fallbacks. Another 20% are still trying to patch structural problems with better prompts. That's the RPA trap replaying. Brittle automations held together by increasingly elaborate rules, waiting to fail at scale.

Snowflake's context layer: fixing the "same data, different answers" problem

Here's a failure mode that doesn't get enough attention: an enterprise runs three agents against the same underlying data, and gets three different answers. Revenue means one thing in the BI dashboard, something different in the SQL table, and something else again in the agent's instruction set.

Christian Kleinerman, EVP of Product at Snowflake, described it plainly: "There are a lot of tools out there that you can ask questions, you get a very confident answer, but whether it's correct or not is different."

Snowflake's answer is a two-layer system. Horizon Context is the customer-managed layer (built on their Select Star acquisition) that pulls metadata from Postgres, SQL Server, Tableau, and Power BI into a shared catalog. Cortex Sense is the platform-derived layer that automatically enriches context from usage patterns and data shape, before any manual curation happens.

The split is deliberate. Horizon Context is what you declare. Cortex Sense is what the platform infers. You can't trust those two things the same way, so treating them differently is the right architectural call. Mike Leone, VP at Moor Insights and Strategy, made this point directly.

IDC research director Devin Pratt: "The context layer is the real battleground for agentic AI. An agent is only as trustworthy as the data and semantics behind it."

Snowflake isn't alone here. Microsoft opened its Fabric IQ business ontology via MCP. Redis launched Iris, a context and memory platform. Pinecone is repositioning from vector database to knowledge engine. The context layer is becoming infrastructure, not a nice-to-have.

But Leone's warning is the one that matters for buyers: "Most vendors selling a drop-in fix are overpromising. Drop one into a real enterprise and it mostly exposes how messy your data and definitions already are."

Microsoft's ASSERT and ACS: policy as code for agent behavior

Microsoft Build's announcements attack a different part of the same problem. Once your agents have good data, how do you make sure they actually do what they're supposed to do?

ASSERT takes your organizational policies and requirements as input, generates targeted evaluation scenarios, and surfaces safety and quality defects before production. It's built on Microsoft Research. The idea: generic benchmarks don't catch company-specific failures. A support agent that should issue refunds below a threshold and escalate likely fraud needs tests written around those exact rules, not some general helpfulness metric.

ACS (Agent Control Specification) is the runtime piece. It defines standard control points where policy checks run: before the agent receives input, before it calls a tool, after the tool returns a result, and before the final response goes to the user. At each point, the policy can allow the action, block it, redact sensitive information, or require human approval.

ACS interception points: where runtime controls fire in the agent lifecycle

The key design decision: ACS policies are single files that travel with the agent. They work across frameworks (LangChain, CrewAI, OpenAI Agents SDK, Anthropic Agents SDK, AutoGen, Semantic Kernel). That matters because the survey found that enterprises don't trust any single provider enough to hand over full control, yet they lack the engineering capacity to build from scratch.

TechCrunch noted that ACS shipped as an SDK with plug-ins for more than 10 frameworks. Winbuzzer reported that Microsoft claims 80-90% judge-human agreement on ASSERT evaluations. Those figures are first-party and should be read with that in mind.

What to actually do with this

Three events in one week, same problem from three angles. Here's what I'd take away:

Audit your data definitions before you audit your agents. Snowflake's context layer and the competing offerings from Microsoft, Redis, and Pinecone all target the same gap: your agents are only as reliable as the shared semantics they operate on. If "revenue" means six different things across six systems, no model improvement fixes that. IDC's Pratt: "Enterprises don't need another silo of semantics. They need a context layer that's governed, portable, and trustworthy enough to audit."

Test your policies, not just your prompts. Microsoft's ASSERT is built on the premise that generic evaluations miss company-specific failures. If your agent handles intake forms or lead qualification, write tests around the exact decisions it makes, not around abstract quality metrics. The framework converts natural-language policies into executable test suites.

Put runtime controls where agents can fail, not at the perimeter. ACS's interception-point model (input, tool call, tool response, output) catches failures mid-process. For teams running agents that touch CRM data, process intake forms, or execute workflows in production systems, this is the layer that prevents a small reasoning error from becoming a customer-facing mistake.

Watch the ROI ceiling. VentureBeat's survey found that cost overruns are now the number one reason agent projects get killed, ahead of hallucination, ahead of state loss. Token economics and orchestration overhead are consuming enough business value that sponsors pull the plug before engineering teams can solve the durability problem. If your agent project doesn't have a cost model with a ceiling, it probably won't survive the pilot.

The 20% still patching with prompts are the ones most likely to fail. The survey data is consistent: teams trying to solve structural infrastructure problems with better prompting are hitting the same wall that killed RPA deployments a decade ago. The 59% actively migrating to durable runtimes or evaluating governance-first architectures have the right read on where this is going.

Sources:

Read more

Like this kind of writing?

One email when something good ships — usually once or twice a month.