Your Multi-Agent Orchestrator Is a Black Box: How We Built Real Observability with OpenTelemetry
I’ve seen it happen too many times.
A team deploys a shiny multi-agent system. Agents talk to each other, call LLMs, hit databases. Everything works in staging. Then production hits, and suddenly one agent starts hallucinating responses. Another silently times out. The orchestrator retries the wrong thing. Nobody knows why.
Why Vietnam Outsourcing Is the Smartest Bet for Your Next Software Project
TL;DR: Vietnam outsourcing delivers 250,000+ IT professionals, 60% lower costs than in‐house US teams, and a time zone… ...
You’re flying blind. And that’s terrifying when your agents handle customer data, financial transactions, or API orchestration.
Here’s the hard truth: most multi-agent systems have zero observability. They log a few lines to stdout and call it a day. But when you have 5, 10, or 50 agents coordinating complex workflows, you need more than `console.log(“Agent X started”)`.
How We Rebuilt a Legacy Logistics Platform in 6 Weeks: A Real Vietnam Offshore Case Study
How We Rebuilt a Legacy Logistics Platform in 6 Weeks: A Real Vietnam Offshore Case Study The call… ...
We learned this the hard way at ECOA AI. Our platform ACP orchestrates hundreds of agents daily for clients. When things broke, we had no idea which agent caused the failure. Was it the data extraction agent? The summarization agent? The LLM call itself?
So we built a proper observability layer using OpenTelemetry. It wasn’t easy. But it transformed how we debug, optimize, and trust our multi-agent systems.
This post walks through exactly what we built, why it works, and how you can replicate it.
Why Most Multi-Agent Observability Fails
Let’s be direct: logging isn’t observability.
Most teams slap a logger on each agent and call it done. But logs are flat. They don’t tell you the *relationship* between events. When Agent A calls Agent B, which then calls an LLM, and the LLM returns garbage — your logs show three separate events. Nothing connects them.
You’re left guessing.
The real problem is distributed tracing. Each agent is like a microservice. They communicate asynchronously. They share state. They call external APIs. Without tracing, you can’t answer simple questions like:
- How long did the *entire* workflow take?
- Which agent consumed the most tokens?
- Where did the 3-second latency spike come from?
We needed a unified view. Enter OpenTelemetry.
Our Architecture: OpenTelemetry + Custom Spans
OpenTelemetry (OTel) is the industry standard for observability. It’s vendor-neutral, supports traces, metrics, and logs, and has SDKs for every major language.
But here’s the catch: OTel was designed for microservices, not AI agents. You need to adapt it.
We built a custom instrumentation layer that wraps every agent call in a span. Each span captures:
- Agent name and version
- Input payload (sanitized)
- Output payload (sanitized)
- LLM model used (e.g., GPT-4o, Claude 3.5)
- Token count (input + output)
- Latency breakdown (agent logic vs. LLM call vs. external API)
- Error status and error message
Here’s the core pattern in Python:
python
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer(__name__)
class ObservableAgent:
def __init__(self, name, model="gpt-4o"):
self.name = name
self.model = model
async def run(self, input_data):
with tracer.start_as_current_span(f"agent.{self.name}") as span:
span.set_attribute("agent.name", self.name)
span.set_attribute("agent.model", self.model)
span.set_attribute("input.size", len(str(input_data)))
try:
# Simulate LLM call
result = await self._call_llm(input_data)
span.set_attribute("output.size", len(str(result)))
span.set_status(Status(StatusCode.OK))
return result
except Exception as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR, str(e)))
raise
This is the foundation. Every agent gets wrapped. Every call gets traced. But we didn’t stop there.
The Three Layers of Our Observability Stack
We organized our observability into three layers. Each serves a different purpose.
Layer 1: Distributed Tracing (Spans)
This is the core. Every agent invocation creates a span. Parent spans represent the orchestrator workflow. Child spans represent individual agent calls.
We use a custom `context_propagator` to pass trace context between agents. This is critical. Without it, you get disconnected spans.
python
from opentelemetry.propagate import inject, extract
async def orchestrate_workflow(input_data):
# Create root span for the workflow
with tracer.start_as_current_span("workflow.main") as root_span:
# Inject context into headers for downstream agents
headers = {}
inject(headers)
# Agent A
result_a = await agent_a.run(input_data, headers=headers)
# Agent B
result_b = await agent_b.run(result_a, headers=headers)
Why this matters: When debugging a failure, you can see the entire chain. Agent A → Agent B → LLM call. You can pinpoint exactly where the latency spike occurred or which agent returned an error.
Layer 2: Metrics (Counters and Histograms)
Traces give you individual request detail. Metrics give you aggregate health.
We expose:
- Agent invocation count (by agent name and status)
- LLM token usage (input vs. output, by model)
- Agent latency (p50, p95, p99)
- Error rate (by agent and error type)
These feed into Grafana dashboards. We can see at a glance which agents are struggling.
Real example: Last month, our `data_extraction` agent showed a sudden p95 latency spike from 200ms to 1.2s. The trace view showed the LLM call was fine, but the pre-processing logic was blocking on a slow Redis query. We fixed it in 15 minutes.
Layer 3: Structured Logs with Trace Context
Logs alone are useless. But logs *with trace IDs* are gold.
We use the Python `structlog` library to attach trace IDs to every log entry. When an agent logs an error, we can immediately jump to the full trace.
python
import structlog
from opentelemetry.trace import get_current_span
logger = structlog.get_logger()
async def run(self, input_data):
span = get_current_span()
trace_id = span.get_span_context().trace_id if span else None
logger.info("agent.invocation", agent=self.name, trace_id=trace_id)
This is the killer feature. When a client reports a bug, we find the trace ID, pull up the full waterfall, and see exactly what happened. No more “it works on my machine.”
What We Learned in Production
We’ve been running this for 6 months across 12 client projects. Here’s what surprised us.
1. Token costs are invisible without tracing.
Before observability, we had no idea which agent consumed the most tokens. Turns out, our `context_enricher` agent was calling the LLM with a massive prompt every time. It was burning $200/month on unnecessary context. We optimized it and cut costs by 40%.
2. Latency spikes are rarely where you think.
We assumed the LLM call was always the bottleneck. Wrong. In one case, a Redis cache lookup was taking 800ms because of a missing index. In another, an agent was waiting on a slow external API that had no timeout. Traces exposed both.
3. Error rates vary wildly by agent.
Some agents are naturally more error-prone (e.g., those parsing unstructured data). Without metrics, you might think all agents are equally reliable. They’re not. We now set different retry policies per agent based on their error rate.
How to Set This Up Yourself
You don’t need our exact setup. But you do need these components:
- OpenTelemetry SDK for your language (we use Python)
- An OTel collector (we use the OpenTelemetry Collector with an OTLP exporter)
- A backend (we use Grafana Tempo for traces, Prometheus for metrics, Loki for logs)
- Custom instrumentation for your agents (the pattern above)
Here’s a minimal docker-compose to get started:
yaml
version: '3.8'
services:
otel-collector:
image: otel/opentelemetry-collector:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml
ports:
- "3200:3200" # tempo
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
The key is the instrumentation. Don’t just install the SDK. Write a wrapper that creates spans for every agent call. Pass trace context between agents. Attach meaningful attributes (agent name, model, token count).
The Bottom Line
Your multi-agent orchestrator is a black box. You don’t know what’s happening inside. And that’s a liability.
Observability isn’t optional. It’s the difference between debugging in hours vs. days. Between optimizing blindly vs. with data. Between trusting your agents vs. fearing them.
We built this for our platform ACP and it’s now a core feature. Our clients get real-time visibility into their agent workflows. They see exactly where time and money go.
If you’re building a multi-agent system, stop treating observability as an afterthought. Instrument from day one. Your future self will thank you.
—
Frequently Asked Questions
How much overhead does OpenTelemetry tracing add to agent performance?
Minimal. Our production data shows less than 2ms added per span. The SDK is designed for high throughput. The bigger concern is storage—traces can accumulate quickly. We sample at 10% for high-volume agents and 100% for critical workflows.
Can I use this with existing agent frameworks like LangChain or CrewAI?
Yes. Both have OpenTelemetry instrumentation packages. But we found they’re too generic. We built custom spans to capture agent-specific attributes (model, token count, etc.). The built-in instrumentation won’t give you that level of detail.
What’s the most common mistake teams make when setting this up?
Not propagating trace context between agents. If you’re using async queues (Redis, RabbitMQ, Kafka), you need to manually inject and extract trace context from message headers. Without it, you get disconnected spans that are useless for debugging.
How do you handle sensitive data in traces?
We sanitize all input and output payloads before attaching them to spans. We also use OpenTelemetry’s `set_attribute` with a custom processor that redacts PII patterns (emails, phone numbers, API keys). Never log raw data.