Event-Driven Architecture: The Quiet Revolution Under the Agent Hype

Polling is dead. MCP added round-trips no one needed. The AI stack is converging, slowly, on the same answer the financial-trading and online-gaming worlds reached 20 years ago: events, not requests.

If you read enterprise AI architecture diagrams from 2024, you saw a lot of arrows. Most of them were request/response. Frontend asks the agent, agent asks the tool, tool asks the database, response walks back up the chain. Repeat for every user action, every five seconds, forever.

That topology was always going to break under multi-agent load. It is now demonstrably breaking, and the answer the industry is converging on isn’t new — it’s just newly relevant.

The pattern is older than AI

Event-driven architecture has been the standard answer to high-throughput, low-latency, loosely-coupled systems since at least the early 2000s. Stock exchanges run on it. Massively-multiplayer games run on it. The entire telecom signalling stack runs on it. Modern logistics, modern banking back-ends, real-time fraud detection — all event-driven, none of them asking each other questions in synchronous round trips.

The reasons are well understood: events scale better than requests, decouple producers from consumers, give you replayability for free, and make backpressure a property of the bus rather than a property of every endpoint.

What changed is that AI workloads — agentic, multi-step, latency-sensitive, with humans waiting — have started to look exactly like the workloads that pushed trading and gaming to event-driven architectures decades ago.

Why the AI request/response default broke

Three things happened at once:

  1. Multi-agent topologies became normal. A single user action now triggers 5–15 LLM calls across multiple agents. Synchronous chains pile latency on latency.
  2. Long-running operations became normal. “Generate a 40-page report” isn’t a request, it’s a process. Holding a connection open for it is wrong.
  3. External triggers became normal. Workflows now react to incoming emails, webhook events, scheduled timers, file drops, IoT signals. None of those are user-initiated requests.

Once any of these is true for your workload, request/response stops being a sufficient primitive. Once all three are true — and they are, for almost every interesting enterprise AI workload — you’re building an event-driven system whether you admit it or not.

The MCP detour. The Model Context Protocol got a lot of mindshare in 2024-25. We use it where it makes sense (per-tool capability descriptions, rapid dev), but as a runtime protocol it has issues: it bakes round-trip request/response into the tool layer, exactly where you don’t want it. For production agent runtimes we’ve mostly moved tool invocation onto the same event bus the rest of the system runs on. That removes 60–80% of the latency overhead and gives us replayability for free.

What an event-driven agent runtime actually looks like

Concretely, our platform looks like this:

  • A central durable event bus (NATS JetStream in our default deployment, Kafka in customer environments that already run it). Every meaningful state transition goes through it.
  • Agents are consumers, not callers. They subscribe to event types they care about, emit new events when they finish, and have no idea who consumes those.
  • Tool calls are events too. An agent emits tool.invoke, the tool service consumes it and emits tool.result. Same bus, same replay semantics, same observability.
  • Workflows are subscribers to event patterns. The workflow engine watches for event sequences that match its topology and advances state accordingly.
  • The UI is an event consumer too. It opens a websocket subscribed to the user’s event stream and renders incrementally. No polling, no “loading...” spinners hiding silent backend work.
// Schematic — not real code, but shape is real
publish('agent.task.started', { agent: 'research', task_id })
  → research-agent consumes
  → publishes 'tool.invoke' { tool: 'web.search', ... }
  → tool-router consumes
  → publishes 'tool.result' { ... }
  → research-agent consumes (resumes)
  → publishes 'agent.task.completed' { result }
  → workflow-engine consumes (advances)
  → UI subscribes to user-stream, renders updates

Notice what isn’t in there: synchronous waits, polling loops, request/response chains, timeouts hiding stuck operations. Each step is a transition in a durable log. Anything that goes wrong is replayable from the bus.

The five wins you only get with events

  1. Backpressure for free. If a tool service is slow, events queue. The system degrades gracefully. With request/response, slow services cause timeout cascades.
  2. Replayability. Bug in production? Replay the event log against the fixed code. We do this almost weekly.
  3. Observability that’s actually useful. Every event has a correlation ID. Tracing a user action through 12 agents is grepping a log, not stitching together 12 distributed traces.
  4. Decoupled deployments. Adding a new agent that reacts to an existing event type requires zero changes to existing services. They emit; the new one subscribes.
  5. External triggers are first-class. An incoming email hitting an SMTP listener is just a publisher of email.received. The rest of the workflow doesn’t care that it wasn’t a user click.

Why most agent frameworks haven’t converged here yet

The popular agent frameworks were optimised for the demo. The demo is one user, one prompt, one chain, one output. That is a request/response shape, and request/response APIs are easier to learn and easier to demo. That doesn’t make them right for production.

The frameworks that have converged on event-driven primitives — LangGraph being the most prominent — tend to look more complex on first encounter and more obvious on third. Once you’re running real workloads, the complexity moves to where you actually need it (the bus) and out of where you don’t (every individual call site).

The lesson the rest of distributed systems learned 20 years ago is now hitting AI workloads. The good news: the playbook is well-documented, the tools are mature, the failure modes are known.

Where to start if you’re still on synchronous chains

  • Pick a bus. NATS for greenfield. Kafka if your enterprise already runs it.
  • Move long-running operations off the request path first. “Generate report” is the obvious first candidate. The user gets immediate acknowledgement and a stream of progress events.
  • Make tool calls go through the bus. Even if the tool service still implements the work synchronously internally, putting the bus in front gives you queueing and observability immediately.
  • Treat the UI as just another subscriber. WebSocket or SSE, subscribed to the user’s relevant event types. Stop polling.
  • Resist the urge to also rewrite everything. Strangler-fig migration. The bus and the synchronous code can coexist for years if needed.

None of this is new. None of this is exotic. It’s the same architecture the trading floor and the MMO server have used for decades. AI workloads finally need it. The companies that adopt it now are setting up for an order of magnitude scale advantage over the ones still wiring agents together with synchronous calls.

Want to See an Event-Driven Agent Stack Up Close?

We’ll walk through our reference architecture and show you what your workloads look like on it.