FinOpsfinopscost-controlbudgets

AI Agent Cost Control: 7 Token Budgets and Quotas That Stop Runaway LLM Spend

A single runaway AI agent can burn your monthly LLM budget in a weekend. 7 controls — token budgets, per-provider quotas, circuit breakers — that keep AI agent spend predictable.

Gil KalMarch 22, 20267 min read

Here is a scenario every team running AI agents will face: a developer deploys an agent on Friday afternoon. The agent enters a retry loop, making thousands of LLM calls over the weekend. By Monday morning, the monthly budget is gone. The agent did not malfunction — it was working exactly as coded. The problem was that nobody set a limit. Cost control for AI agents is not a nice-to-have. It is the difference between a sustainable AI practice and a budget emergency.

The challenge is unique to AI workloads. Traditional software costs are largely predictable — a server costs a fixed amount per hour, a database charges per query at a known rate. AI agents introduce variable costs that depend on the task complexity, the model used, the number of retries, and whether the agent decides to call other agents or tools along the way. A simple summarization task might cost a fraction of a cent. A multi-step research task with several LLM calls and tool invocations might cost several dollars. Multiply that by dozens of agents running continuously, and costs become significant fast.

Why Agent Costs Are Hard to Predict

Traditional software has predictable costs: a database query costs roughly the same every time. AI agents are different. An agent deciding between two approaches might make 5 LLM calls or 50, depending on the complexity of the task. Different models have wildly different pricing — GPT-4o costs 10x more than GPT-4o-mini per token. And agents that use tool calls can trigger cascading costs: one agent calls another, which calls an LLM, which triggers a RAG lookup. The cost graph is a tree, not a line.

There is also the hidden cost of retries. When an LLM returns a malformed response, many agent frameworks automatically retry. If the prompt is ambiguous or the task is inherently difficult for the model, those retries can multiply costs by 3x to 5x without any visible indication in the application logs. The agent appears to be working normally — it just costs five times more than expected.

The most expensive AI agent is not the one using the most powerful model. It is the one running without any cost limits at all.

Three Layers of Cost Control

Effective agent cost control operates at three levels, each catching what the others miss. Think of them as nested safety nets — if a cost spike gets past one layer, the next one catches it.

1. Token Budgets (Per Agent)

Every agent gets a token budget — a hard limit on how many tokens it can consume per day, week, or month. When the budget is exhausted, the agent stops making LLM calls. No exceptions. This prevents the Friday-afternoon scenario entirely. The budget should be based on observed usage during development plus a reasonable buffer, not on theoretical maximums.

Setting the right budget requires a brief observation period. Run the agent for two weeks, measure its actual token consumption across various task types, then set the budget at 1.5x to 2x the observed average. This gives enough headroom for natural variation while catching genuine anomalies. Review and adjust budgets monthly as agent usage patterns evolve.

2. Provider Quotas (Per Organization)

Provider quotas set limits at the organization level — total spend per LLM provider per month. Even if individual agents stay within their budgets, the aggregate can exceed what the organization intended to spend. Provider quotas are the ceiling that catches this. Alerts fire at 80, 90, and 100 percent of the quota, giving teams time to react before the limit hits.

Provider quotas also help with capacity planning. If your organization has a $5,000 monthly budget for OpenAI and a $2,000 budget for Anthropic, provider quotas enforce those limits automatically. When the 80 percent alert fires, the operations team can decide whether to increase the quota, shift workloads to a cheaper provider, or throttle non-essential agents. These are business decisions that should happen proactively, not in response to a surprise invoice.

3. Model Restrictions (Per Policy)

Not every agent needs GPT-4o. Most routine tasks — summarization, classification, simple extraction — work fine with smaller, cheaper models. Model restrictions let you specify which models each agent is allowed to use. Your security agent can use Claude Opus for nuanced analysis, while your logging agent is restricted to GPT-4o-mini. This alone can cut costs by 60 to 80 percent for many workloads.

Model restrictions also prevent accidental upgrades. Without restrictions, a developer testing an agent might switch it to a more expensive model for better results during development and forget to switch it back before deployment. Model restrictions at the policy level ensure that production agents always use the approved model, regardless of what the code specifies.

The Gateway as Cost Enforcer

An Agentic Gateway is the natural enforcement point for cost controls. Since every LLM call flows through the gateway, it can check budgets in real-time, enforce model restrictions, and log costs per agent, per user, and per organization. Without a gateway, cost control requires each agent framework to implement its own limits — which means inconsistent enforcement and blind spots.

// Gateway rejects calls when budget is exhausted
// Response: 429 Too Many Requests
{
  "error": "BUDGET_EXHAUSTED",
  "message": "Agent 'code-review' has exceeded its monthly token budget",
  "budget_limit": 500000,
  "budget_used": 512340,
  "resets_at": "2026-04-01T00:00:00Z"
}

The gateway also provides the data needed for cost optimization. By logging every LLM call with the requesting agent, model, token count, and cost, it builds a complete picture of where money is going. This data feeds the FinOps dashboard, which shows cost trends, forecasts, and anomalies — the information teams need to make informed decisions about budgets and model assignments.

How Dobby Enforces Cost Controls

Dobby implements all three layers of cost control through its Agentic Gateway. Token budgets are configured per agent through the dashboard or API, and the gateway checks them on every LLM call. Provider quotas are set at the organization level, with automatic Slack alerts at 80, 90, and 100 percent thresholds. Model restrictions are part of the governance policy system, enforced at the gateway before the call reaches the LLM provider.

The FinOps dashboard provides five views of cost data: overview with daily trends, cost by provider and model, cost by agent with drill-down to individual requests, cost by user, and budget management with forecasting. Monthly forecasts extrapolate from the current run rate, so teams can see whether they are on track to stay within budget before the month ends. Every data point comes from real gateway traffic — there is no estimation or sampling.

FinOps Dashboard

Cost control without visibility is guesswork. A FinOps dashboard shows cost by agent, by provider, by model, by user, and by department — in real time. Daily trend charts reveal patterns: which agents cost the most, which models drive spending, and where costs are growing fastest. Monthly forecasts based on current run rate help teams plan budgets before they are exceeded, not after.

The most actionable insight from a FinOps dashboard is usually the cost-per-task metric. If your code review agent costs $0.15 per review with GPT-4o but $0.02 per review with GPT-4o-mini — and the quality difference is negligible for routine reviews — that is a 7x cost reduction by changing one configuration. These optimizations are invisible without per-agent, per-model cost tracking. They are obvious with it.

Getting Started

Start by routing all LLM calls through a gateway. This gives you cost visibility without changing agent behavior. Next, set conservative token budgets based on two weeks of observed usage. Then, review model assignments — you will likely find agents using expensive models for tasks that cheaper models handle equally well. Finally, set provider quotas at the organization level with alerts at 80 percent. In most cases, these four steps reduce agent costs by 40 to 60 percent with no loss in capability.

The key principle is to measure before you optimize. Many teams guess at where their costs are and apply controls in the wrong places. Two weeks of data from a gateway will show you exactly where money is going, and the optimization path becomes obvious. Cost control is not about restricting agents — it is about making sure every dollar spent on LLM calls delivers value to the organization.

Ready to take control of your AI agents?

Start free with Dobby AI — connect, monitor, and govern agents from any framework.

Get Started Free