Dobby
Back to Blog
Productwebhooksgatewayasync

Webhooks Are Live on Dobby Gateway

Subscribe to signed HTTP events — approvals, kill-switches, policy blocks — from Dobby Gateway. HMAC-SHA256 signed, retried on 5xx, landed in a DLQ on 4xx, inspectable in the admin dashboard. Here's how it works and why we built it the way we did.

Gil KalApril 19, 20266 min read

Today we shipped customer-facing webhooks on Dobby Gateway. If you've been polling our approval status endpoint waiting for a human decision, you don't have to anymore — we'll POST to a URL you own, signed with a secret only you hold, the moment the decision lands. Same thing for kill-switches, policy blocks, and agent registrations.

This is a small surface area — five event types at launch — but it closes a real pain point. A gateway that stops at the HTTP response loses every system that needs to react asynchronously. Your SIEM wants policy violations. Your on-call rotation wants kill-switches. Your provisioning workflow wants 'agent registered' so it can seed dashboards. Polling for each of these from five separate teams is how you end up with a duplicate Kafka pipeline nobody admits to running.

What you get in the box

  • A webhook UI at /dashboard/gateway/webhooks — URL, events, description, test button.
  • An API mirror: POST /api/v1/gateway/webhooks with events + URL, GET to list, DELETE to revoke, a rotate-secret endpoint.
  • Five event types at v1: approval.pending, approval.resolved, kill_switch.activated, gateway.policy_violation, agent.registered.
  • HMAC-SHA256 signatures with a timestamp, in a header format that matches what Stripe and Slack use.
  • Retries on 5xx and timeouts with exponential backoff (~30s → 8m), five attempts max, then the delivery lands in a DLQ.
  • A per-webhook circuit breaker — five consecutive failures auto-disables the subscription so a broken endpoint can't blast an inbox once it wakes up.

Why BullMQ on Valkey, not Cloud Tasks

The interesting decision was the transport. Cloud Tasks is the obvious GCP-native answer, but we chose BullMQ on our existing Valkey cluster, and I want to walk through why because it's a decision we keep making across the platform.

Dobby already runs a shared Valkey (Redis 8) cluster via GCP Memorystore with private service connect. Rate limiting uses it, cache uses it, session storage uses it. Layering a BullMQ queue on top means zero new infrastructure — one more ioredis connection with different retry options, one new deployment for a Node worker process. Cloud Tasks would add a managed queue we'd need IAM bindings for, OIDC token minting on HTTP invocations, and a separate story for multi-region residency when we get there.

The shape of our workload also fights Cloud Tasks. Our jobs are short side effects — sign a body, POST, write a BQ audit row. Bounded at seconds, not minutes. Cloud Tasks' sweet spot is long-running push to durable HTTP endpoints, which would pull us back toward the inline-HTTP-processing coupling we're trying to escape. With a worker pulling from a Redis list, we keep the work in-process.

And bull-board is a gift. It's a drop-in Fastify plugin that gives us a queue observability dashboard for free — queue depth, failed jobs, retry counts, a re-run button. Cloud Tasks has console.cloud.google.com, which is fine for us but we'd need to build our own admin UI anyway because customers will want to see their delivery history. Doing it once, with bull-board for the ops view and a custom page at /admin/webhook-deliveries for the customer-scoped view, beat running two stacks.

Zero new infrastructure is not a vanity metric — it's the difference between shipping the queue next week and shipping the queue after a two-week debate with the platform team.

The signature format we chose

Every webhook POST carries an X-Dobby-Signature header in the form t=1700000000,v1=abc123…. The signed string is literally ${timestamp}.${raw body} hashed with HMAC-SHA256. This is intentionally the same format Stripe pioneered — customers who have already implemented Stripe webhook verification can copy-paste their handler. We gain a version tag (v1) so if we ever rotate the algorithm we don't strand existing integrations.

Including the timestamp in the signed string blocks a trivial replay attack. If an attacker grabs a signed payload off the wire, resubmitting it five minutes later won't verify because our docs (and our reference verifier) reject signatures outside a 300-second tolerance. No nonce table, no server-side replay cache — the client rejects it before it reaches your code.

import crypto from 'crypto';

const header = req.header('x-dobby-signature');
const parts = Object.fromEntries(header.split(',').map(p => p.split('=')));
const t = parseInt(parts.t, 10);
const v1 = parts.v1;

if (Math.abs(Date.now() / 1000 - t) > 300) return res.status(400).end();

const expected = crypto
  .createHmac('sha256', process.env.DOBBY_WEBHOOK_SECRET)
  .update(`${t}.${req.body.toString('utf8')}`)
  .digest('hex');

const ok = crypto.timingSafeEqual(
  Buffer.from(expected, 'hex'),
  Buffer.from(v1, 'hex'),
);

The DLQ is a product, not a dumping ground

Every delivery attempt writes a row to ds_platform.webhook_deliveries with status, HTTP code, duration, and a 4 KB peek of your endpoint's response body. Four statuses — pending, delivered, failed, dlq. When your endpoint returns a 4xx we send it straight to dlq because retrying won't change a customer's mind about rejecting the payload. 5xx and timeouts go to failed with a retry scheduled, and eventually dlq after five attempts.

What matters: the DLQ is queryable. At /admin/webhook-deliveries (superadmin only) we list every failed attempt across every org with a one-click replay. The response body is there. The signature we actually sent is there. When a customer opens a support ticket saying 'we didn't receive the approval resolution at 3am' the first stop is the DLQ view, not a log aggregator — and in practice that answers 80% of questions without engineering involvement.

What's next

Five events at launch is a conservative surface. Obvious next additions: agent.deregistered, budget.exceeded, gateway.cost_alert, and event subscriptions via Server-Sent Events for customers who don't want to run an HTTP endpoint at all. If you're subscribed to approval.resolved today and want something more specific, tell us — we'd rather ship the event you need than a generic catch-all.

If you're on the Gateway, webhooks are one flag flip away on your side. The full event catalog, signature verification recipes in Node and Python, and the retry policy are in the docs.

Ready to take control of your AI agents?

Start free with Dobby AI — connect, monitor, and govern agents from any framework.

Get Started Free
Webhooks Are Live on Dobby Gateway — Dobby AI Blog