Why Your Multi-Agent System Hangs (And How to Fix It with Timeouts, Retries, and Circuit Breakers)
You’ve built your first multi-agent system. It’s beautiful. Each agent has a clear job, they pass messages, and everything works in your local dev environment.
Then you deploy to production. And it hangs.
Why Smart CTOs Hire Vietnamese Developers: The 2025 Offshoring Playbook
TL;DR: Vietnam is emerging as the top destination for offshore software development in 2025. Lower costs, strong technical… ...
Not crashes—just freezes. One agent is waiting for a response that never comes. Another is stuck in a loop. Your orchestrator logs show nothing useful. You restart the whole thing, and it works again… for a while.
Sound familiar?
Why Smart CTOs Hire Vietnamese Developers: A Data-Driven Guide to Vietnam Tech Talent
TL;DR: Hiring Vietnamese developers offers a unique blend of strong technical skills, favorable time zones (UTC+7), competitive rates… ...
We’ve been there. Our team in Can Tho, Vietnam, spent three weeks debugging a multi-agent pipeline for a US logistics client. The system would run for 12 hours, then deadlock. We tried everything. Eventually, we fixed it with three patterns that should be in every agent engineer’s toolkit: timeouts, retries with backoff, and circuit breakers.
Let me show you exactly how.
The Problem: Why Multi-Agent Systems Deadlock
Multi-agent systems are inherently concurrent. Agents communicate via message queues, HTTP calls, or shared state. Any of those can fail silently. A downstream agent crashes. A network partition splits the cluster. A third-party LLM API takes 60 seconds instead of 2.
Your orchestrator doesn’t know the difference between “still processing” and “never coming back”. So it waits. Forever.
That’s the root cause: no bounded waiting.
Here’s what we saw in production:
- Agent A sends a task to Agent B.
- Agent B calls an external API that times out after 30 seconds.
- Agent B’s HTTP client doesn’t raise an error—it just hangs.
- Agent A waits for Agent B’s response indefinitely.
- The whole pipeline stalls.
Classic deadlock. And it’s embarrassingly common.
Pattern 1: Always Set Timeouts on Every Agent Interaction
This sounds obvious, but you’d be surprised how many agent frameworks default to infinite waits. Never rely on that.
We use Python’s `asyncio.wait_for` for async calls, and `requests` with `timeout` for sync calls. Here’s a concrete example from our production orchestrator:
python
import asyncio
from typing import Any
class AgentTimeoutError(Exception):
pass
async def call_agent_with_timeout(
agent_endpoint: str,
payload: dict,
timeout_seconds: float = 10.0
) -> dict:
"""Call an agent with a hard timeout. Raises AgentTimeoutError if exceeded."""
try:
result = await asyncio.wait_for(
_make_agent_call(agent_endpoint, payload),
timeout=timeout_seconds
)
return result
except asyncio.TimeoutError:
raise AgentTimeoutError(
f"Agent at {agent_endpoint} did not respond within {timeout_seconds}s"
)
We set different timeouts per agent type. A simple data transformation agent gets 5 seconds. An LLM-based reasoning agent gets 30 seconds. A file upload agent gets 60 seconds.
Key rule: timeout must be less than the orchestrator’s own timeout. Otherwise you just shift the deadlock up the chain.
Pattern 2: Retry with Exponential Backoff (But Know When to Stop)
Retries are great—until they’re not. We’ve seen systems that retry forever, burning API credits and making the hang worse.
Our retry policy uses exponential backoff with jitter and a max retry count. Here’s the function we use:
python
import random
import time
from functools import wraps
def retry_with_backoff(
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 60.0
):
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
last_exception = None
for attempt in range(1, max_retries + 1):
try:
return await func(*args, **kwargs)
except AgentTimeoutError as e:
last_exception = e
if attempt == max_retries:
raise
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay * 0.1)
time.sleep(delay + jitter)
raise last_exception
return wrapper
return decorator
We apply this decorator to agent calls. After 3 retries with exponential backoff (1s, 2s, 4s), we give up and let the orchestrator handle the failure.
But here’s the critical insight: retries only help when failures are transient. If the downstream agent is permanently down, retrying just wastes time. That’s where circuit breakers come in.
Pattern 3: Circuit Breakers to Stop Cascading Failures
A circuit breaker monitors failures and opens the circuit when a threshold is exceeded. Once open, subsequent calls fail fast without attempting the operation. After a cooldown period, the circuit goes half-open to test if the service recovered.
We implemented a simple circuit breaker for each agent endpoint:
python
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold: int = 5, recovery_timeout: float = 30.0):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time = 0.0
def call(self, func, *args, **kwargs):
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
else:
raise Exception("Circuit breaker open")
try:
result = func(*args, **kwargs)
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
self.failure_count = 0
return result
except Exception:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
raise
We attach one circuit breaker per agent instance. If an agent fails 5 times in a row, we stop calling it for 30 seconds. That prevents the whole system from hanging while the problematic agent is down.
Real numbers from our production system: After adding circuit breakers, our pipeline uptime went from 89% to 99.7%. The remaining 0.3% is when a new agent deployment fails—and that’s fine, because we catch it fast.
Putting It All Together: The Orchestrator Loop
Here’s how we combine all three patterns in our main orchestrator:
python
async def run_pipeline(steps: list[dict]):
for step in steps:
agent_id = step["agent_id"]
payload = step["payload"]
circuit_breaker = circuit_breakers[agent_id]
@retry_with_backoff(max_retries=3)
async def safe_call():
return await call_agent_with_timeout(
agent_endpoint(agent_id),
payload,
timeout_seconds=agent_timeouts[agent_id]
)
try:
result = circuit_breaker.call(safe_call)
except Exception as e:
log_error(f"Step {step['name']} failed: {e}")
# Decide: abort pipeline or skip step?
if step.get("critical"):
raise
else:
continue
# Use result in next step...
Notice we use `retry_with_backoff` inside the circuit breaker. That’s intentional: retries handle transient blips, circuit breaker handles sustained failures. They’re complementary, not redundant.
Why This Matters for Your Team
If you’re building multi-agent systems—whether with LangGraph, CrewAI, or a custom framework—you will hit these issues. I guarantee it.
We’ve seen teams in Ho Chi Minh City and Can Tho spend weeks chasing hangs that were caused by a single missing timeout. Don’t be that team.
A quick checklist for your next review:
- Does every agent-to-agent call have a timeout? (Yes, even local function calls.)
- Are retries bounded? (Max 3-5 attempts, exponential backoff.)
- Do you have circuit breakers for critical agents? (Especially those calling external APIs.)
- Is your orchestrator logging every timeout and circuit open event? (You’ll need that data to tune thresholds.)
Frequently Asked Questions
Q: Should I use timeouts or circuit breakers first?
Start with timeouts. They’re simpler and prevent most hangs. Add circuit breakers when you see repeated failures from the same agent. Circuit breakers protect your system from wasting resources on a dead service.
Q: What timeout value should I use for LLM-based agents?
It depends on the model and prompt complexity. For GPT-4 with short prompts, 30 seconds is safe. For long context or chain-of-thought, go up to 60 seconds. Measure your P99 latency and set timeout to 2x that. We use 45 seconds for our Claude Sonnet agents.
Q: How do I handle retries when the agent is idempotent?
Make sure your agent endpoints are idempotent before enabling retries. If a retry could cause duplicate side effects (e.g., charging a credit card), use a unique request ID and deduplicate on the server side. Otherwise, retry only safe operations like reads or stateless transformations.
Q: Can I use these patterns with LangGraph or CrewAI?
Yes. LangGraph supports custom timeout via `asyncio.wait_for` on node calls. CrewAI allows you to wrap agent tasks with custom functions. For both, you can inject retry and circuit breaker logic in the task definition. We’ve done it—just wrap the call inside a decorator or a custom `AgentExecutor`.
Related: Vietnam offshore development — Learn more about how ECOA AI can help your team.
Related: Vietnam software outsourcing — Learn more about how ECOA AI can help your team.
Related: Outsource to Vietnam — Learn more about how ECOA AI can help your team.
Related reading: Why Smart CTOs Hire Vietnamese Developers Over Other Offshore Hubs