Building a Control Plane for AI Agents: Architecture Guide
The architecture behind an AI agent control plane. Multi-tenant isolation, policy engines, audit trails, and gateway design.
What you will learn
- Understand the architectural layers of an AI agent control plane
- Design multi-tenant isolation for agent workspaces
- Build a policy engine that scales across thousands of agents
- Architect an audit trail that satisfies SOC 2 and GDPR requirements
What Makes a Control Plane?
A control plane is the management layer that sits above operational workloads. Kubernetes is a control plane for containers — it does not run your code, it orchestrates where and how your code runs. Similarly, an AI agent control plane does not execute agent logic — it connects, observes, governs, and scales agents from any framework.
The key insight: the control plane must be framework-agnostic. It cannot assume all agents use CrewAI, or LangChain, or OpenAI. It must work with any agent that can make an HTTP call.
The Five Architectural Layers
- Gateway Layer — unified proxy for all LLM and tool requests. Authentication, rate limiting, cost tracking.
- Policy Layer — 4-level policy hierarchy (platform → org → tenant → user). Model restrictions, budgets, DLP.
- Observability Layer — immutable audit trail, real-time activity feed, cost analytics, health monitoring.
- Governance Layer — approval workflows, kill-switch, role-based access control, compliance controls.
- Integration Layer — A2A, MCP, REST, webhooks. Connect agents from any framework via any protocol.
Multi-Tenant Isolation
Multi-tenancy is the most critical architectural decision. Every organization gets complete data isolation. Every tenant (workspace) within an organization gets its own dataset, its own policies, and its own agent instances.
// 3-level hierarchy with regional isolation
//
// Platform (global)
// └── Organization (org_xxx)
// ├── Tenant IL (tenant_xxx) → ds_tenant_il, ds_agents_il
// ├── Tenant EU (tenant_yyy) → ds_tenant_eu, ds_agents_eu
// └── Tenant US (tenant_zzz) → ds_tenant_us, ds_agents_us
//
// Data NEVER crosses regional boundaries
// Cross-region visibility only via UNION ALL views (admin only)Shared database with tenant_id column filtering. A SQL injection or query bug exposes data across tenants. Compliance nightmare.
Physically separate datasets per region. Each tenant's data in its own partitioned, clustered tables. A bug in one tenant cannot affect another. Clean compliance boundary.
The Policy Engine
Policies cascade through 4 layers. A platform policy sets the floor (minimum security). An organization policy can be stricter but not more permissive. A tenant policy applies within a workspace. A user-level override handles exceptions.
// Policy cascade example
// Platform: max_tokens = 8192, models = [all]
// Organization: max_tokens = 4096, models = [gpt-4o, claude-sonnet-4-20250514]
// Tenant: max_tokens = 4096, models = [claude-sonnet-4-20250514] (stricter)
// User override: max_tokens = 8192 (exception for lead engineer)
//
// Effective for lead engineer: max_tokens = 8192, models = [claude-sonnet-4-20250514]
// Effective for everyone else: max_tokens = 4096, models = [claude-sonnet-4-20250514]Audit Trail Architecture
The audit trail uses an append-only pattern in BigQuery. Every event is a new row — never UPDATE, never DELETE. This creates an immutable record that satisfies SOC 2 and GDPR requirements. Each event includes: what happened, who did it, when, which agent, which model, how many tokens, and the full request/response.
- task_timeline — every task event (created, assigned, approved, completed, failed)
- llm_gateway_requests — every LLM call (model, tokens, cost, latency, actor)
- admin_audit_log — every admin action (policy change, key creation, role assignment)
- security_events — authentication, authorization, anomalies (365-day retention)
Dobby stores over 135 tables across 6 BigQuery datasets, with regional replication for IL/EU/US. The append-only pattern with version + is_latest columns ensures data integrity while supporting efficient queries via partitioning (by date) and clustering (by tenant_id, status).
Gateway Design Principles
- OpenAI-compatible API — standard SDK works, no custom client libraries needed
- Circuit breaker — 5 failures → open, 30s → half-open, prevents cascade failures
- Rate limiting — per-key + per-org, sliding window in Redis, configurable by key type
- Semantic cache — two-layer (Redis exact + BigQuery cosine similarity), opt-in per org
- Kill-switch — cached 5s in Redis, propagates across all gateway instances instantly
- DLP — 9 PII patterns redacted before LLM calls (configurable per org policy)
The Gateway is designed to fail open for non-critical features (semantic cache, analytics) and fail closed for security features (auth, DLP, kill-switch). A Redis outage degrades cache performance but never blocks legitimate requests.