Architecturegatewayllmmcp

Agentic Gateway Explained: Auth, Cost Tracking, and Policy for Every LLM Call

The agentic gateway is a unified proxy that authenticates, meters, and enforces governance on every LLM and MCP request. Why every AI platform needs one, and how to build it.

Gil KalMarch 15, 20268 min read

Every organization using AI agents has the same problem: LLM API calls are scattered across services, each with its own API key, no consistent cost tracking, and no way to enforce usage policies. A developer on the backend team uses GPT-4 directly. The data team has a LangChain pipeline calling Claude. The DevOps agent hits Gemini through a custom wrapper. Each integration is a silo -- separate credentials, separate billing, separate rules.

The Agentic Gateway pattern solves this by routing all LLM and MCP traffic through a single authenticated proxy. Instead of each agent managing its own provider connections, every request flows through one gateway that handles authentication, cost tracking, policy enforcement, and observability. It is the same pattern that API gateways like Kong and Apigee brought to REST APIs -- but purpose-built for the unique requirements of AI agent traffic.

What Is an Agentic Gateway?

Think of it as an API gateway specifically designed for AI agent traffic. It sits between your agents and the LLM providers, handling four critical functions. Authentication through a three-tier key system ensures every request is tied to a known identity. Cost metering tracks tokens, dollars, and latency on every request in real time. Policy enforcement applies model restrictions, rate limits, and content filtering before requests reach the provider. And observability provides distributed tracing, anomaly detection, and behavior profiling across your entire agent fleet.

The gateway is OpenAI-compatible, which means any tool or SDK that works with the OpenAI API works with the gateway out of the box. You change the base URL and the API key, and everything else stays the same. No custom SDK, no code rewrite, no vendor lock-in. This is a deliberate design choice: the best infrastructure disappears into your existing workflow rather than demanding you adopt a new one.

The Three-Tier Key System

Not all callers are equal, and the gateway's key system reflects this reality. User keys (gk_user_*) are designed for individual developers, each with personal rate limits of 100 requests per minute and 8 possible permission scopes. Service keys (gk_svc_*) are for backend services and production agents, with higher throughput at 500 requests per minute and broader access. Temporary keys (gk_tmp_*) are for CI/CD pipelines, one-off experiments, and ephemeral tasks -- they auto-expire after a configured duration and are limited to 50 requests per minute.

Each key is SHA-256 hashed before storage -- the raw key is never persisted, only shown once at creation time. Keys support IP allowlisting for restricting access to known networks, scope restrictions that limit which API features a key can access, and per-key budget limits that cap spend independently from organization budgets. A single user can hold up to 20 keys per organization, and each key's usage is tracked independently for fine-grained cost attribution down to the individual developer or service.

How It Works in Practice

Using the gateway requires exactly two changes to your existing code: swap the base URL and use a gateway key instead of a provider key. Here is a complete example using the standard OpenAI Python SDK to make a call through Dobby's Agentic Gateway:

from openai import OpenAI

# Route through Dobby's Agentic Gateway
client = OpenAI(
    api_key="gk_user_your_gateway_key",
    base_url="https://dobby-ai.com/api/v1/gateway"
)

# Use any supported model from any provider
response = client.chat.completions.create(
    model="claude-sonnet-4",   # Anthropic via gateway
    messages=[
        {"role": "system", "content": "You are a code reviewer."},
        {"role": "user", "content": "Review this function for bugs."}
    ],
    stream=True  # Streaming fully supported
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Behind the scenes, the gateway validates your key and checks the associated scopes and permissions. It then evaluates the request against the active policy: is the requested model allowed? Is there budget remaining for this organization and this key? Is the rate limit satisfied? If all checks pass, it routes the request to the correct provider, streams the response back chunk by chunk, and logs the full request with input tokens, output tokens, model, calculated cost in dollars, and response latency -- all in a single pass with minimal overhead.

Cost Control That Actually Works

The gateway meters every request with precision: input tokens, output tokens, model used, provider, calculated cost in dollars, and response latency. This data flows into real-time dashboards showing cost by model, cost by team, cost by agent, and cost by user. You can drill down from a monthly total to an individual request in seconds.

Budget enforcement is proactive, not reactive. You set daily and monthly limits at the organization, tenant, or agent level. The gateway checks the remaining budget before forwarding each request. Budget alerts fire at 80 percent, 90 percent, and 100 percent thresholds via Slack. When a budget is exhausted, the gateway returns a clear error -- it does not silently drop requests or let costs continue to accumulate.

The kill-switch is the last line of defense. It can instantly halt all LLM traffic for an organization, with three scopes: stop everything, stop only LLM calls (preserving MCP tool access), or block only new key creation. The kill-switch state is cached in Redis with a 5-second TTL, ensuring fast propagation without sacrificing reliability.

Streaming and Provider Failover

Production AI applications demand streaming responses. The gateway supports full Server-Sent Events streaming for all 13 or more providers, translating between each provider's streaming format and the OpenAI-compatible SSE protocol that clients expect. This means you can switch from GPT-4 to Claude to Gemini without changing your client-side streaming code.

Provider reliability is handled through a circuit breaker pattern. If a provider starts returning errors, the circuit breaker opens after 5 consecutive failures and stops routing traffic to that provider for 30 seconds. After the cooldown, a single test request is sent. If it succeeds, normal traffic resumes. If the organization has configured fallback providers, the gateway automatically routes to the fallback during an outage -- transparent to the calling agent.

The best infrastructure is invisible. Your agents should not know or care that their LLM provider had a 30-second outage -- the gateway handles it.

Security and DLP

Every request through the gateway passes through configurable Data Loss Prevention filters. The DLP engine scans for 9 categories of PII -- credit card numbers, social security numbers, email addresses, phone numbers, API keys, and more -- and can redact or block requests that contain sensitive data before they reach the LLM provider. This is critical for organizations handling customer data, financial information, or healthcare records.

The gateway also supports anomaly detection. A behavior profiling subsystem learns the normal patterns for each key and each agent -- typical request volume, average token count, model usage distribution -- and flags deviations. If a developer key that normally makes 20 requests per day suddenly makes 2,000, an alert fires. This catches compromised keys, runaway agents, and prompt injection attacks.

MCP Server Integration

Beyond LLM proxying, the gateway also serves as a full MCP (Model Context Protocol) server. It exposes over 70 tools organized into functional groups: task management (create, update, list tasks), agent control (pause, resume, check status), knowledge base access (search documents, add memories), scheduling (create and manage recurring jobs), and GitHub integration (PR management, code review). All tools are authenticated through the same key system. This means Claude Desktop, Cursor, VS Code, Gemini CLI, and other MCP clients can connect directly to your organization's agent infrastructure with proper access controls, scopes, and audit logging.

The MCP endpoint accepts JSON-RPC 2.0 messages and supports the complete MCP specification including tools, resources, and prompts. Every MCP tool call is logged with the same fidelity as LLM requests -- who called it, what tool, what parameters, what result, and how long it took. This gives you a unified view of all agent activity regardless of whether it was an LLM call or a tool invocation. For security teams, this means one audit trail covers everything: every model query and every tool execution across the entire organization.

Building Your Own vs. Using Dobby

You can build a basic LLM proxy in a weekend. Route requests, add an API key check, log the tokens. But production-grade features take months to get right. Circuit breakers per provider with configurable thresholds. Streaming support that handles backpressure correctly across 13 different provider APIs. Anomaly detection that learns normal behavior and alerts on deviations. Behavior profiling for security. DLP scanning with configurable rules. SOC 2-compliant immutable audit logging with 365-day retention. Budget enforcement that checks limits in real time without adding latency.

Dobby's Agentic Gateway ships with all of this out of the box. It is built on 52 TypeScript modules with dedicated subsystems for anomaly detection, behavior profiling, FinOps analytics, performance monitoring, and quota synchronization. The gateway handles over 40 API routes across core operations, organization-level management, and semantic caching. A two-layer semantic cache (Redis for exact matches under 1 millisecond, BigQuery cosine similarity for similar queries at 20 to 50 milliseconds) reduces costs further by serving cached responses for repeated or near-identical requests. Teams that adopt it save 3 to 6 months of infrastructure work and get enterprise-grade governance from day one.

Ready to take control of your AI agents?

Start free with Dobby AI — connect, monitor, and govern agents from any framework.

Get Started Free