The Agent Cost Stack

What it actually costs to run an AI agent in production.

Apr 05, 2026

2026 is the year AI agents have to prove ROI. The prototypes shipped. Now the question is whether these things actually pay for themselves at scale.

I did a deep research dive into the economics. I expected the story to be about expensive tokens and infrastructure bills. It is not. The story is about a trap.

Everything that makes agents cheaper to build is making them more expensive to run.

I am calling it the Deflation Trap. It runs through every layer of what I have been mapping as the Agent Cost Stack.

The Jevons Layer

Per-token costs have dropped roughly 1,000× in three years. GPT-3-equivalent inference fell from $60 to $0.06 per million tokens. The decline runs at roughly 10× per year, faster than Moore’s Law.

And yet total AI spending surged 320% in 2025. Average enterprise LLM spending nearly tripled from $2.5 million to $7 million per company.

This is Jevons paradox applied to intelligence. When inference gets cheaper, usage does not stay flat — it explodes. Agentic workflows with dozens of reasoning steps per task. AI on every customer interaction. Use cases that were uneconomical last quarter become default this quarter. Demand is super-elastic.

The math of agentic loops makes it concrete. Agents operate in multi-turn reasoning where every turn reprocesses the entire conversation history. Turn 10 costs 10× what a single generation does. Cumulative spend across 10 turns is 55×. Reasoning models generating thousands of internal thinking tokens can require up to 100× more compute than a single pass. One complex agentic task can burn $5–$8 in inference alone.

Mitigations exist. Context caching from major providers cuts input costs 75–90%. Model routing — sending most queries to cheap models and only complex ones to frontier — cuts costs 35–85%. Semantic caching reduces redundant calls 20–73%. These make the growth curve manageable. They do not flatten it.

The cheaper the tokens get, the more tokens get used.

The Infrastructure Layer

The floor has collapsed. Serverless vector databases start under $25 a month. Object-storage-first options run at $70 per terabyte versus $1,600–$3,600 for RAM-based incumbents. Production RAG systems have been documented running for under $10 a month.

The $7,000–$21,000 monthly figure in enterprise analyses reflects enterprise procurement, not technical necessity. The actual range by agent complexity:

Simple FAQ or support bot: $50–$500/mo
RAG-based knowledge agent: $200–$2,000/mo
Multi-step agent with tool calling: $500–$5,000/mo
Enterprise multi-agent system with compliance: $5,000–$30,000/mo

A collapsed floor means more projects get greenlit. More projects get greenlit means more production systems to maintain. The infrastructure cost per agent went down. The total infrastructure bill went up.

The trap here is premature optimization — architecting for scale that does not exist yet. Know the actual volume before committing to architecture.

The Drift Layer

This is the layer most teams underestimate because it does not show up in a pricing calculator.

Agents are nondeterministic. A demo that works 80% of the time is impressive. A production system that fails 20% of the time is a liability. And agents drift. Prompt behavior decays. Model providers ship silent updates that change output quality. In long-lived sessions, system prompts receive roughly 1% of attention weight in large context windows, causing agents to gradually ignore their own instructions. Retrieval accuracy drops 15–30% as context stretches.

Budget 10–20 hours of engineering time per month on prompt tuning and behavior testing. That is $1,000–$2,500 a month per agent. This cost holds regardless of how cheap the tokens get. Automated evaluation frameworks agree with human judgment at only fair-to-moderate levels — they are force multipliers for human review, not replacements.

The cost here is not compute. It is attention. Human attention to a system that will not stay where you put it.

The Migration Layer

Between early 2025 and early 2026, major providers executed 15+ distinct model deprecation events. Major agent frameworks shipped multiple architecturally breaking versions. A survey of 1,837 engineering leaders found that 70% of regulated enterprises rebuild their agent stack every three months or faster.

AI coding tools make individual migrations 10–20× faster. But migrations now happen quarterly instead of annually. Total maintenance spend stays stubbornly similar. The composition shifts from manual labor to AI tooling costs plus expanded review cycles. And the productivity gains themselves may be illusory: METR’s randomized controlled trial found that experienced developers using AI coding tools perceived themselves as 20% faster but were actually 19% slower on real-world tasks — a 39-point perception gap. If the tool that is supposed to make migrations cheap is not actually making them faster, the math gets worse.

This is where the trap is most visible. More gets built on shifting ground because teams believe AI tools make it cheap to migrate. The per-migration cost feels lower. The number of migrations rises. Net cost: flat at best. Net complexity: up.

The defense: architect for replaceability from day one. Abstract the model layer. Budget 1–2 model migrations per year, each consuming 1–2 engineering weeks.

The Governance Layer

The EU AI Act is enforced starting August 2, 2026. It is real. But the cost most teams should worry about is not compliance — it is paralysis.

The Act uses a risk-tiered structure. The heavy burden risk management systems, conformity assessments, documentation maintained for ten years applies exclusively to high-risk systems under Annex III: HR screening, credit scoring, healthcare decision support, law enforcement. Only 5–15% of AI systems qualify. Annual compliance costs for those: roughly €29,000 per system. A Jira summarizer, a customer support bot, an internal knowledge agent none of these are high-risk.

Minimal-risk systems face no specific obligations beyond AI literacy. Limited-risk systems, including chatbots and virtual assistants, face only transparency requirements. One requirement is non-negotiable across all tiers: Article 50 requires that any system interacting with people discloses it is AI. A disclosure banner. A first-message label. Low-cost design change, easy fine if you miss it. Bake it in from day one.

The real governance cost for most teams is not the regulation — it is the months of delayed launches while legal reviews whether a low-risk agent needs high-risk treatment. If you are building in Annex III domains, budget for governance. If you are not, do not let regulatory anxiety kill a high-ROI project.

On the US side, govern agents with the same access controls you apply to employees. That principle holds regardless of jurisdiction.

The Pattern

The cost stack varies by an order of magnitude depending on agent complexity. A support bot on serverless can run for a few hundred dollars a month. An enterprise multi-agent system with compliance can run $20,000+. But the 65–75% operational share holds across both. The absolute numbers change. The ratio does not.

Every reduction in the cost of building creates more systems that need the expensive work of running. Tokens get cheaper, so architectures consume more. Infrastructure gets cheaper, so more projects ship. Migrations get cheaper per instance, so they happen more often.

Klarna learned this the hard way. The build was a success story. The run was an overcorrection that required hiring humans back. The long tail of nuanced, high-empathy interactions, refund disputes, billing confusion, frustrated customers who needed to feel heard, could not be cost-optimized away. The agent was not the problem. The budget that ignored the run was.

What to Do Before You Build

Five things to budget for before writing the first prompt:

Model the run cost, not the build cost. If 65–75% of your total spend will be operational, your business case should be mostly about ops — not the sprint to ship.
Measure your actual token volume before optimizing. Context caching and model routing are powerful, but premature optimization is its own cost. Know the volume first.
Budget human hours for drift. 10–20 hours per agent per month. Automated evals help. They do not replace the human in the loop.
Abstract the model layer on day one. You will migrate at least twice a year. Make it a configuration change, not a rewrite.
Classify your risk tier early. Five minutes with the Annex III list can save months of legal review. If you are not high-risk, move.

Every efficiency in building is a tax on running. The only way through the Deflation Trap is to budget for the run before you start the build.

The views expressed here are my own and are not related to or reflective of my work or any organization I am affiliated with.

Thinking Through AI

Discussion about this post

Ready for more?