GatewayIntermediate

AI Agent Cost Control: Budgets, Quotas & Alerts (2026)

Track AI costs per agent, set token budgets, configure provider quotas, and get Slack alerts before you overspend. Reduce LLM spend 20-40% with caching.

10 min read Gil KalMar 25, 2026

What you will learn

Track AI costs per agent, per provider, and per user
Set up token budgets with automatic alerts at 80%, 90%, 100%
Configure provider quotas to prevent overspend
Use the FinOps dashboard to optimize cost allocation
Apply 4 quick wins that typically save 20-40% on day one

TL;DR — AI costs are invisible until the invoice arrives. Fix that with three layers: per-call tracking, per-agent budgets with alerts, and provider quotas that prevent 429s. Add semantic caching and model tiering and most teams cut spend 20-40% in the first month.

The AI Cost Problem

AI agent costs are invisible until the bill arrives. A single agent running a loop with GPT-4 can burn $500 in an afternoon. Multiply that by 10 agents across 3 providers, and your monthly AI spend becomes unpredictable and uncontrollable.

The solution is not to stop using AI — it is to instrument every call, set limits, and get alerts before the damage is done.

Without Dobby

End-of-month surprise: $12,000 OpenAI bill. Nobody knows which agent caused it. The team scrambles to check logs across 3 provider dashboards.

With Dobby

Real-time dashboard shows cost per agent. Budget alert fired at $500 (80% of $625 monthly limit). The responsible agent was automatically throttled before reaching the cap.

Three Layers of Cost Control

Layer 1: Tracking — Every LLM call is metered with tokens consumed, cost calculated, and attributed to an agent, user, and provider
Layer 2: Budgets — Token budgets per agent, per tenant, per organization. Daily and monthly limits with configurable thresholds
Layer 3: Quotas — Provider-level quotas that sync with your actual provider limits (requests per day, tokens per minute)

Setting Up Token Budgets

json

// Example: Create a monthly budget for an agent
POST /api/v1/tenants/{tenantId}/budgets
{
  "name": "Backend Agent Monthly",
  "agent_id": "agent_backend_001",
  "budget_type": "monthly",
  "token_limit": 500000,
  "cost_limit_usd": 50.00,
  "alert_thresholds": [80, 90, 100],
  "action_on_limit": "block"  // or "warn"
}

When a budget threshold is hit, Dobby sends a Slack alert to #dobby-alerts with the agent name, current spend, and budget limit. At 100%, the agent is automatically blocked from making further LLM calls.

Provider Quotas

Provider quotas are different from budgets. Budgets control how much you want to spend. Quotas reflect how much your provider allows you to spend — API rate limits, requests per day, tokens per minute.

Sync your provider quotas into the platform so agents are automatically throttled before hitting provider rate limits. This prevents 429 errors and wasted retry cycles.

The FinOps Dashboard

The FinOps dashboard provides 5 views of your AI spend: Overview (total cost, daily trend, forecast), Cost by Agent (which agents cost the most), Cost by Provider (OpenAI vs Anthropic vs Google), Cost by User (who is consuming the most), and Cost by Department.

Daily trend chart with 30-day forecast
Per-agent drill-down with provider and model breakdown
Budget tracking with visual progress bars
Monthly comparison and week-over-week delta

4 Quick Wins (Usually 20-40% Savings)

1. Route simple tasks to cheaper models

Classification, extraction, and short summaries do not need GPT-4o. Gemini Flash and GPT-4o-mini deliver 10-20x cost reduction with negligible quality loss on these tasks.

2. Turn on semantic caching

The Gateway's semantic cache detects near-duplicate prompts and returns the previous response in under 1ms. For repetitive workloads (FAQ bots, classification pipelines) it eliminates 20-40% of LLM calls outright.

3. Restrict models per agent

A compromised agent that can call Claude Opus is much more expensive than one restricted to Haiku. Model allow-lists cap both cost and blast radius.

4. Review the By Agent page every Monday

Most cost surprises trace back to one agent that slowly drifted. A 5-minute weekly review catches the drift before it compounds.

The Gateway semantic cache can reduce LLM costs by 20-40% for repetitive queries. It checks if a semantically similar question was already answered and returns the cached response in under 1ms.

Frequently Asked Questions

How accurate is per-request cost attribution?

Costs are calculated from the provider's official per-model pricing plus input/output tokens returned by the API. Typical accuracy is within 1-2% of the provider invoice. Reconciliation views show any delta so you can trust the dashboard for monthly close.

Can I set budgets that throttle instead of block?

Yes. Budgets support three action modes: warn (Slack alert only), throttle (rate-limit the agent), and block (stop all LLM calls until the next period).

How do I chargeback cost across departments?

Tag agents with department or cost-center metadata at registration. The Cost by Department view aggregates and exports to CSV for finance, with filterable date ranges and per-tenant breakdown.

Related Features

Finops Mcp Analytics

Continue Learning

AI Agent Governance: 7 Production Best Practices (2026)FinOps for AI Agents: Per-Agent Cost Tracking & Optimization

Ready to try this yourself?

Start free — no credit card required.

Start Free