Orchestration vs Choreography: Why Your Multi-Agent System Needs Both (and How to Get It Right)
Let me be blunt: most multi-agent systems I’ve seen in production are brittle. They either rely on a single central coordinator that becomes a bottleneck, or they let agents broadcast messages like teenagers at a party—chaotic, noisy, and impossible to debug.
The problem isn’t the agents. It’s the coordination model.
How to Build a Production-Ready RAG Pipeline: A Developer’s Guide to Vector Search, Chunking, and LLM Integration
How to Build a Production-Ready RAG Pipeline: A Developer’s Guide to Vector Search, Chunking, and LLM Integration Let’s… ...
After spending the last year building production systems with teams in Ho Chi Minh City and Can Tho, I’ve learned that choosing between orchestration and choreography isn’t a one-time architectural decision. It’s a tradeoff you make for every subsystem. And if you get it wrong, your agents will fail silently, retry endlessly, or just hang.
Here’s what actually works.
Best Open Source AI Tools 2026: Local LLMs, Vector Databases, and Multi-Agent Systems That Actually Work
Best Open Source AI Tools 2026: Local LLMs, Vector Databases, and Multi-Agent Systems That Actually Work TL;DR: The… ...
The Two Camps: A Quick Primer
Orchestration means a central controller tells each agent what to do and when. Think of a conductor leading an orchestra.
Choreography means agents observe events and react independently. Think of dancers who know their part and respond to the music.
Both patterns have been around for decades in distributed systems. But in the AI agent world, the stakes are higher. Agents aren’t deterministic microservices. They can hallucinate, time out, or get stuck in loops.
Orchestration: The Central Controller Pattern
You’ve probably seen this. A single “supervisor” agent receives a task, breaks it down, and delegates subtasks to specialized agents. The supervisor tracks progress, handles errors, and assembles the final result.
python
class OrchestratorAgent:
def __init__(self):
self.research_agent = ResearchAgent()
self.code_agent = CodeGenerator()
self.review_agent = CodeReviewer()
async def execute(self, task: str):
# Step 1: Research
context = await self.research_agent.gather(task)
# Step 2: Generate code
code = await self.code_agent.generate(context)
# Step 3: Review
feedback = await self.review_agent.review(code)
# Step 4: Iterate if needed
if feedback.needs_revision:
code = await self.code_agent.revise(code, feedback)
return code
This works well for predictable, sequential workflows. But here’s the catch: the supervisor becomes a single point of failure. If it crashes, everything stops. And if the task is complex, the supervisor’s context window fills up fast.
Choreography: The Event-Driven Pattern
In choreography, agents publish events and subscribe to events they care about. No single agent controls the flow.
python
class EventBus:
def __init__(self):
self.subscribers = defaultdict(list)
async def publish(self, event: Event):
for handler in self.subscribers[event.type]:
asyncio.create_task(handler(event))
def subscribe(self, event_type: str, handler):
self.subscribers[event_type].append(handler)
# Agents subscribe to relevant events
event_bus.subscribe("code_generated", review_agent.handle_new_code)
event_bus.subscribe("review_completed", deploy_agent.handle_approved_code)
This scales beautifully. Agents can come and go without affecting others. But debugging? A nightmare. You can’t easily trace causality. And if an agent fails to handle an event, there’s no built-in retry mechanism.
The Hybrid Approach: What We Actually Ship
Here’s the pattern that’s worked for us at ECOA AI: orchestrate the critical path, choreograph everything else.
Think about your system. Some workflows are mission-critical and need strict guarantees. Payment processing. User authentication. Data validation. These should be orchestrated.
Other workflows are parallel, optional, or fire-and-forget. Logging. Analytics. Notification. These should be choreographed.
Let me give you a concrete example. We recently helped a logistics client in Ho Chi Minh City rebuild their shipment tracking system. The old system used pure orchestration: one central coordinator handled every step from pickup to delivery. When the coordinator went down (which happened weekly), all tracking stopped.
We split the system:
- Orchestrated: The core shipment workflow (pickup → sorting → transit → delivery). This needs guaranteed ordering and error recovery.
- Choreographed: Ancillary tasks like sending SMS updates, updating dashboards, and triggering billing. If the SMS agent fails, the shipment still moves.
Result? 40% fewer operational failures. And the team in Can Tho could debug issues in minutes instead of hours.
When to Use Which: A Decision Matrix
| Criteria | Use Orchestration | Use Choreography |
|---|---|---|
| Task sequence matters | ✅ | ❌ |
| Error recovery is critical | ✅ | ⚠️ (harder) |
| Agents need to scale independently | ❌ | ✅ |
| Debugging ease | ✅ | ❌ |
| Agent count > 10 | ❌ | ✅ |
| Response time SLA < 100ms | ⚠️ (bottleneck) | ✅ |
Honestly, most teams over-engineer this. They jump to choreography because it sounds more scalable, then spend weeks building observability tools to debug event flows. Or they default to orchestration and hit a wall when the system grows beyond 5 agents.
A Practical Hybrid Architecture
Here’s the exact architecture we use at ECOA AI Platform ACP:
- Orchestrator agents handle the main workflow. They’re stateless and use a queue (Redis or RabbitMQ) for persistence.
- Worker agents execute subtasks and emit events on completion.
- An event bus routes non-critical events to secondary agents.
- A shared state store (PostgreSQL or Redis) lets agents read/write context without passing everything through the orchestrator.
python
class HybridWorkflow:
def __init__(self, orchestrator, event_bus, state_store):
self.orchestrator = orchestrator
self.event_bus = event_bus
self.state_store = state_store
async def process_order(self, order_id: str):
# Orchestrated: critical path
order = await self.orchestrator.validate(order_id)
payment = await self.orchestrator.process_payment(order)
shipment = await self.orchestrator.create_shipment(order, payment)
# Choreographed: non-critical
await self.event_bus.publish(Event("order_processed", {
"order_id": order_id,
"shipment_id": shipment.id
}))
return shipment
Notice how the orchestrator doesn’t wait for the event handlers to complete. That’s the key insight. The critical path stays fast and predictable. Everything else happens in the background.
The Hidden Cost of Pure Orchestration
I’ve seen teams build beautiful orchestration graphs with LangGraph or similar tools. They look great in diagrams. But in production, they leak memory, stall on agent timeouts, and require constant babysitting.
The problem is state explosion. Every agent call adds context to the orchestrator’s memory. After 10-15 steps, the prompt becomes unwieldy. The LLM starts forgetting earlier instructions.
We measured this: with pure orchestration, agent accuracy dropped 23% after 8 sequential calls. With the hybrid approach, accuracy remained above 95% because agents only saw relevant context.
What the ECOA AI Platform Does Differently
Our platform (ECOA AI Platform ACP) bakes this hybrid pattern in from day one. Developers don’t have to choose between orchestration and choreography. They define:
- Critical flows as directed acyclic graphs (DAGs) with built-in retry and circuit breakers
- Event-driven flows as pub/sub channels with automatic dead-letter queues
The platform handles the routing, state management, and observability. Our Vietnamese engineering team in Can Tho uses this internally to build client systems 5x faster.
But you don’t need our platform to adopt this pattern. Start by auditing your agent workflows. Identify the 20% of steps that are truly critical. Orchestrate those. Let the other 80% dance on their own.
Frequently Asked Questions
What happens if a choreographed agent fails silently?
That’s the biggest risk. Always implement a dead-letter queue for event-driven agents. Set up alerts when events go unprocessed for more than X minutes. In our systems, we use Redis streams with consumer groups so we can track which agents have processed each event.
Can I convert an existing orchestrated system to hybrid without rewriting everything?
Yes, but do it incrementally. Start by identifying non-critical agent calls that don’t need immediate responses. Move those to an event bus. Monitor for a week. Then expand. We did this for a fintech client and saw zero downtime during the migration.
How do you handle shared state between agents in a hybrid system?
Use a centralized state store (we prefer PostgreSQL with JSONB columns). Each agent reads only the keys it needs. The orchestrator writes the critical path state. Event-driven agents write their results independently. This avoids the context-window explosion problem entirely.
Our team is small. Is the hybrid approach worth the complexity?
If you have fewer than 3 agents, stick with simple orchestration. The overhead isn’t worth it. But as soon as you hit 4-5 agents, you’ll start seeing coordination issues. That’s the sweet spot to introduce event-driven patterns for non-critical tasks.
Related reading: Outsourcing Software: The No-BS Guide to Offshore Engineering Success
Related reading: Why Smart CTOs Hire Vietnamese Developers: Cost, Quality & Delivery Speed