AdvancedAdvanced

Building a Control Plane for AI Agents: Architecture Guide

The architecture behind an AI agent control plane. Multi-tenant isolation, policy engines, audit trails, and gateway design — with real numbers.

15 min read Gil KalMar 18, 2026

What you will learn

Understand the architectural layers of an AI agent control plane
Design multi-tenant isolation for agent workspaces
Build a policy engine that scales across thousands of agents
Architect an audit trail that satisfies SOC 2 and GDPR requirements
Recognize fail-open vs fail-closed design decisions

TL;DR — A control plane for AI agents is five cooperating layers (Gateway, Policy, Observability, Governance, Integration) built on physically isolated multi-tenant data. Get the isolation and audit trail right and everything else composes cleanly.

What Makes a Control Plane?

A control plane is the management layer that sits above operational workloads. Kubernetes is a control plane for containers — it does not run your code, it orchestrates where and how your code runs. Similarly, an AI agent control plane does not execute agent logic — it connects, observes, governs, and scales agents from any framework.

The key insight: the control plane must be framework-agnostic. It cannot assume all agents use CrewAI, or LangChain, or OpenAI. It must work with any agent that can make an HTTP call.

The Five Architectural Layers

Gateway Layer — unified proxy for all LLM and tool requests. Authentication, rate limiting, cost tracking.
Policy Layer — 4-level policy hierarchy (platform → org → tenant → user). Model restrictions, budgets, DLP.
Observability Layer — immutable audit trail, real-time activity feed, cost analytics, health monitoring.
Governance Layer — approval workflows, kill-switch, role-based access control, compliance controls.
Integration Layer — A2A, MCP, REST, webhooks. Connect agents from any framework via any protocol.

Multi-Tenant Isolation

Multi-tenancy is the most critical architectural decision. Every organization gets complete data isolation. Every tenant (workspace) within an organization gets its own dataset, its own policies, and its own agent instances.

text

// 3-level hierarchy with regional isolation
//
// Platform (global)
//   └── Organization (org_xxx)
//       ├── Tenant IL (tenant_xxx) → ds_tenant_il, ds_agents_il
//       ├── Tenant EU (tenant_yyy) → ds_tenant_eu, ds_agents_eu
//       └── Tenant US (tenant_zzz) → ds_tenant_us, ds_agents_us
//
// Data NEVER crosses regional boundaries
// Cross-region visibility only via UNION ALL views (admin only)

Without Dobby

Shared database with tenant_id column filtering. A SQL injection or query bug exposes data across tenants. Compliance nightmare.

With Dobby

Physically separate datasets per region. Each tenant's data in its own partitioned, clustered tables. A bug in one tenant cannot affect another. Clean compliance boundary.

The Policy Engine

Policies cascade through 4 layers. A platform policy sets the floor (minimum security). An organization policy can be stricter but not more permissive. A tenant policy applies within a workspace. A user-level override handles exceptions.

text

// Policy cascade example
// Platform: max_tokens = 8192, models = [all]
// Organization: max_tokens = 4096, models = [gpt-4o, claude-sonnet-4-6]
// Tenant: max_tokens = 4096, models = [claude-sonnet-4-6] (stricter)
// User override: max_tokens = 8192 (exception for lead engineer)
//
// Effective for lead engineer: max_tokens = 8192, models = [claude-sonnet-4-6]
// Effective for everyone else: max_tokens = 4096, models = [claude-sonnet-4-6]

Audit Trail Architecture

The audit trail uses an append-only pattern in BigQuery. Every event is a new row — never UPDATE, never DELETE. This creates an immutable record that satisfies SOC 2 and GDPR requirements. Each event includes: what happened, who did it, when, which agent, which model, how many tokens, and the full request/response.

task_timeline — every task event (created, assigned, approved, completed, failed)
llm_gateway_requests — every LLM call (model, tokens, cost, latency, actor)
admin_audit_log — every admin action (policy change, key creation, role assignment)
security_events — authentication, authorization, anomalies (365-day retention)

Dobby stores over 135 tables across 6 BigQuery datasets, with regional replication for IL/EU/US. The append-only pattern with version + is_latest columns ensures data integrity while supporting efficient queries via partitioning (by date) and clustering (by tenant_id, status).

Gateway Design Principles

OpenAI-compatible API — standard SDK works, no custom client libraries needed
Circuit breaker — 5 failures → open, 30s → half-open, prevents cascade failures
Rate limiting — per-key + per-org, sliding window in Redis, configurable by key type
Semantic cache — two-layer (Redis exact + BigQuery cosine similarity), opt-in per org
Kill-switch — cached 5s in Redis, propagates across all gateway instances instantly
DLP — 9 PII patterns redacted before LLM calls (configurable per org policy)

The Gateway is designed to fail open for non-critical features (semantic cache, analytics) and fail closed for security features (auth, DLP, kill-switch). A Redis outage degrades cache performance but never blocks legitimate requests.

Designing for Evolution

A control plane's real test is not day one — it is year three, after every layer has absorbed new providers, new frameworks, new regulations. The durable design choices are: speak open protocols at the edges (OpenAI, JSON-RPC, A2A), keep tenancy physical not logical, and treat policy as code with versioning and rollback. Everything else can be rewritten; these three cannot.

Frequently Asked Questions

Can I start with logical multi-tenancy and migrate later?

Technically yes, practically no. Migrating live tenants to physically separated datasets is painful, risky, and never quite complete. If you expect any compliance scope, design for physical isolation from day one.

Why BigQuery for the audit trail?

Append-only semantics, cheap long-retention storage, SQL queryability for auditors, partitioning + clustering for performance on billions of rows, and built-in regional isolation. Few other stores hit all five.

How do you keep the Gateway fast with all these checks?

Authentication, rate limiting, and kill-switch live in Redis with millisecond reads. Policy evaluation is compiled once per policy version and cached. The async paths (audit logging, analytics aggregation) happen after the response is returned to the caller.

Related Features

Mcp Agent Hierarchy Sso Finops

Continue Learning

FinOps for AI Agents: Per-Agent Cost Tracking & Optimization Autonomous AI Agent Execution: Scheduling, Governance & Safety

Ready to try this yourself?

Start free — no credit card required.

Book a Demo