Why Your Multi-Agent System Hangs (And How to Fix It with Timeouts, Retries, and Circuit Breakers)

You’ve built your first multi-agent system. It’s beautiful. Each agent has a clear job, they pass messages, and everything works in your local dev environment.

Then you deploy to production. And it hangs.

How We Migrated a 1.2TB PostgreSQL Database with Zero Downtime: A Vietnam Offshore Case Study

How We Migrated a 1.2TB PostgreSQL Database with Zero Downtime: A Vietnam Offshore Case Study Database migrations are… ...

Not crashes—just freezes. One agent is waiting for a response that never comes. Another is stuck in a loop. Your orchestrator logs show nothing useful. You restart the whole thing, and it works again… for a while.

Sound familiar?

Why Outsourcing Software Development Is Smarter Than You Think

TL;DR: Outsourcing software development is no longer just about cutting costs. Done right, with the right partner and… ...

We’ve been there. Our team in Can Tho, Vietnam, spent three weeks debugging a multi-agent pipeline for a US logistics client. The system would run for 12 hours, then deadlock. We tried everything. Eventually, we fixed it with three patterns that should be in every agent engineer’s toolkit: timeouts, retries with backoff, and circuit breakers.

Let me show you exactly how.

The Problem: Why Multi-Agent Systems Deadlock

Multi-agent systems are inherently concurrent. Agents communicate via message queues, HTTP calls, or shared state. Any of those can fail silently. A downstream agent crashes. A network partition splits the cluster. A third-party LLM API takes 60 seconds instead of 2.

Your orchestrator doesn’t know the difference between “still processing” and “never coming back”. So it waits. Forever.

That’s the root cause: no bounded waiting.

Here’s what we saw in production:

Agent A sends a task to Agent B.
Agent B calls an external API that times out after 30 seconds.
Agent B’s HTTP client doesn’t raise an error—it just hangs.
Agent A waits for Agent B’s response indefinitely.
The whole pipeline stalls.

Classic deadlock. And it’s embarrassingly common.

Pattern 1: Always Set Timeouts on Every Agent Interaction

This sounds obvious, but you’d be surprised how many agent frameworks default to infinite waits. Never rely on that.

We use Python’s `asyncio.wait_for` for async calls, and `requests` with `timeout` for sync calls. Here’s a concrete example from our production orchestrator:

python
import asyncio
from typing import Any

class AgentTimeoutError(Exception):
    pass

async def call_agent_with_timeout(
    agent_endpoint: str,
    payload: dict,
    timeout_seconds: float = 10.0
) -> dict:
    """Call an agent with a hard timeout. Raises AgentTimeoutError if exceeded."""
    try:
        result = await asyncio.wait_for(
            _make_agent_call(agent_endpoint, payload),
            timeout=timeout_seconds
        )
        return result
    except asyncio.TimeoutError:
        raise AgentTimeoutError(
            f"Agent at {agent_endpoint} did not respond within {timeout_seconds}s"
        )

We set different timeouts per agent type. A simple data transformation agent gets 5 seconds. An LLM-based reasoning agent gets 30 seconds. A file upload agent gets 60 seconds.

Key rule: timeout must be less than the orchestrator’s own timeout. Otherwise you just shift the deadlock up the chain.

Pattern 2: Retry with Exponential Backoff (But Know When to Stop)

Retries are great—until they’re not. We’ve seen systems that retry forever, burning API credits and making the hang worse.

Our retry policy uses exponential backoff with jitter and a max retry count. Here’s the function we use:

python
import random
import time
from functools import wraps

def retry_with_backoff(
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 60.0
):
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            last_exception = None
            for attempt in range(1, max_retries + 1):
                try:
                    return await func(*args, **kwargs)
                except AgentTimeoutError as e:
                    last_exception = e
                    if attempt == max_retries:
                        raise
                    delay = min(base_delay * (2 ** attempt), max_delay)
                    jitter = random.uniform(0, delay * 0.1)
                    time.sleep(delay + jitter)
            raise last_exception
        return wrapper
    return decorator

We apply this decorator to agent calls. After 3 retries with exponential backoff (1s, 2s, 4s), we give up and let the orchestrator handle the failure.

But here’s the critical insight: retries only help when failures are transient. If the downstream agent is permanently down, retrying just wastes time. That’s where circuit breakers come in.

Pattern 3: Circuit Breakers to Stop Cascading Failures

A circuit breaker monitors failures and opens the circuit when a threshold is exceeded. Once open, subsequent calls fail fast without attempting the operation. After a cooldown period, the circuit goes half-open to test if the service recovered.

We implemented a simple circuit breaker for each agent endpoint:

python
import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, recovery_timeout: float = 30.0):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0.0

    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise Exception("Circuit breaker open")

        try:
            result = func(*args, **kwargs)
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
            raise

We attach one circuit breaker per agent instance. If an agent fails 5 times in a row, we stop calling it for 30 seconds. That prevents the whole system from hanging while the problematic agent is down.

Real numbers from our production system: After adding circuit breakers, our pipeline uptime went from 89% to 99.7%. The remaining 0.3% is when a new agent deployment fails—and that’s fine, because we catch it fast.

Putting It All Together: The Orchestrator Loop

Here’s how we combine all three patterns in our main orchestrator:

python
async def run_pipeline(steps: list[dict]):
    for step in steps:
        agent_id = step["agent_id"]
        payload = step["payload"]

        circuit_breaker = circuit_breakers[agent_id]

        @retry_with_backoff(max_retries=3)
        async def safe_call():
            return await call_agent_with_timeout(
                agent_endpoint(agent_id),
                payload,
                timeout_seconds=agent_timeouts[agent_id]
            )

        try:
            result = circuit_breaker.call(safe_call)
        except Exception as e:
            log_error(f"Step {step['name']} failed: {e}")
            # Decide: abort pipeline or skip step?
            if step.get("critical"):
                raise
            else:
                continue

        # Use result in next step...

Notice we use `retry_with_backoff` inside the circuit breaker. That’s intentional: retries handle transient blips, circuit breaker handles sustained failures. They’re complementary, not redundant.

Why This Matters for Your Team

If you’re building multi-agent systems—whether with LangGraph, CrewAI, or a custom framework—you will hit these issues. I guarantee it.

We’ve seen teams in Ho Chi Minh City and Can Tho spend weeks chasing hangs that were caused by a single missing timeout. Don’t be that team.

A quick checklist for your next review:

Does every agent-to-agent call have a timeout? (Yes, even local function calls.)
Are retries bounded? (Max 3-5 attempts, exponential backoff.)
Do you have circuit breakers for critical agents? (Especially those calling external APIs.)
Is your orchestrator logging every timeout and circuit open event? (You’ll need that data to tune thresholds.)

Frequently Asked Questions

Q: Should I use timeouts or circuit breakers first?

Start with timeouts. They’re simpler and prevent most hangs. Add circuit breakers when you see repeated failures from the same agent. Circuit breakers protect your system from wasting resources on a dead service.

Q: What timeout value should I use for LLM-based agents?

It depends on the model and prompt complexity. For GPT-4 with short prompts, 30 seconds is safe. For long context or chain-of-thought, go up to 60 seconds. Measure your P99 latency and set timeout to 2x that. We use 45 seconds for our Claude Sonnet agents.

Q: How do I handle retries when the agent is idempotent?

Make sure your agent endpoints are idempotent before enabling retries. If a retry could cause duplicate side effects (e.g., charging a credit card), use a unique request ID and deduplicate on the server side. Otherwise, retry only safe operations like reads or stateless transformations.

Q: Can I use these patterns with LangGraph or CrewAI?

Yes. LangGraph supports custom timeout via `asyncio.wait_for` on node calls. CrewAI allows you to wrap agent tasks with custom functions. For both, you can inject retry and circuit breaker logic in the task definition. We’ve done it—just wrap the call inside a decorator or a custom `AgentExecutor`.

Related: Vietnam offshore development — Learn more about how ECOA AI can help your team.

Related: Vietnam software outsourcing — Learn more about how ECOA AI can help your team.

Related: Outsource to Vietnam — Learn more about how ECOA AI can help your team.

Why Your Multi-Agent System Hangs (And How to Fix It with Timeouts, Retries, and Circuit Breakers)

Why Your Multi-Agent System Hangs (And How to Fix It with Timeouts, Retries, and Circuit Breakers)

How We Migrated a 1.2TB PostgreSQL Database with Zero Downtime: A Vietnam Offshore Case Study

Why Outsourcing Software Development Is Smarter Than You Think

The Problem: Why Multi-Agent Systems Deadlock

Pattern 1: Always Set Timeouts on Every Agent Interaction

Pattern 2: Retry with Exponential Backoff (But Know When to Stop)

Pattern 3: Circuit Breakers to Stop Cascading Failures

Putting It All Together: The Orchestrator Loop

Why This Matters for Your Team

Frequently Asked Questions

Read more:

Leave a Comment Cancel reply

Ready to Build with AI-Powered Developers?

Why Your Multi-Agent System Hangs (And How to Fix It with Timeouts, Retries, and Circuit Breakers)

Why Your Multi-Agent System Hangs (And How to Fix It with Timeouts, Retries, and Circuit Breakers)

The Problem: Why Multi-Agent Systems Deadlock

Pattern 1: Always Set Timeouts on Every Agent Interaction

Pattern 2: Retry with Exponential Backoff (But Know When to Stop)

Pattern 3: Circuit Breakers to Stop Cascading Failures

Putting It All Together: The Orchestrator Loop

Why This Matters for Your Team

Frequently Asked Questions

Read more:

Leave a Comment Cancel reply

RELATED POSTS

Ready to Build with AI-Powered Developers?