Your Multi-Agent System Has No Tests: How We Built an Agent Evaluation Pipeline That Caught 92% of Regressions Before Production

You wouldn’t deploy a microservice without unit tests. But I bet your multi-agent system ships with zero validation.

Don’t feel bad. Most teams do the same thing.

Vietnam Outsourcing: The Data-Driven Case for Choosing Vietnam as Your Offshore Dev Hub

TL;DR: Vietnam outsourcing delivers top-tier software engineers at 40–50% cost savings compared to the US, with retention rates… ...

We did too. For six months, we tuned prompts, swapped models, and added new agents to our logistics orchestration system. Every change was a gamble. Did that new GPT-4o prompt actually improve routing accuracy? Or did we just break three edge cases we forgot to check?

The answer: we broke stuff. Constantly.

Why Smart CTOs Hire Vietnamese Developers in 2025

TL;DR: Vietnam is emerging as the top destination for offshore software development. You get skilled engineers at 30–50%… ...

Here’s the hard truth: your multi-agent system is flying blind without an evaluation pipeline. You can’t measure regressions, you can’t compare agent versions, and you can’t confidently ship changes. Eventually, something breaks in production and your on-call rotation pays the price.

This is the story of how we fixed that. I’ll show you the exact architecture, the metrics that matter, and a reusable Python template we now use across every multi-agent deployment at ECOA AI.

Why Agent Testing Is Different from Code Testing

Let’s get one thing straight: you can’t unit test an LLM call the same way you test a function.

A function either returns the right value or it doesn’t. An agent call produces a string, and “right” is subjective. Did the agent extract the correct shipping date? Maybe. Does it match the format the downstream agent expects? Who knows.

The problem compounds in multi-agent systems. Agent A passes data to Agent B, which transforms it for Agent C. A subtle drift in A’s output ripples through the chain. By the time it hits the orchestrator, the result is garbage wrapped in perfectly fluent English.

Here’s what we learned the hard way:

Testing Approach	Catches Syntax Errors	Catches Logic Errors	Catches Drift	Tracks Regressions
Manual review	✅	✅	❌	❌
Unit tests on code	✅	❌	❌	❌
Integration tests	✅	✅	❌	❌
Agent eval pipeline	✅	✅	✅	✅

Code tests validate *structure*. Agent evals validate *behavior*. They’re complementary, not interchangeable.

The Anatomy of an Agent Eval Pipeline

We built this with a team of three senior developers in Can Tho, Vietnam. Total implementation time: three weeks. Here’s the architecture in plain terms.

Core Components

Test Case Repository — A structured dataset of inputs, expected outputs, and evaluation criteria
Agent Runner — Invokes the multi-agent workflow with the test input
Evaluator — Compares actual outputs against expected outputs using multiple metrics
Regression Tracker — Stores historical results and flags performance drops
Report Generator — Produces a human-readable diff report for the team

The pipeline runs on every PR that touches agent prompts, model configurations, or routing logic. It also runs on a cron schedule to catch drift from upstream model changes.

What We Test

We defined five evaluation categories. Each maps to a concrete metric.

1. Accuracy (45% weight) — Does the agent produce the correct answer? We use exact match for structured outputs and semantic similarity for free-text. Our threshold: ≥ 0.85 cosine similarity using `text-embedding-3-small`.

2. Completeness (20% weight) — Did the agent include all required fields? For example, a shipping validation agent must return `ship_date`, `carrier`, `zone`, and `cost_code`. Missing fields are automatic failures.

3. Format Compliance (15% weight) — Does the output match the expected schema? JSON outputs get validated against a JSON Schema. Markdown outputs get a structure check.

4. Latency (10% weight) — Did the agent complete within the SLA? We track p50 and p95. Any test that exceeds 2x the baseline gets flagged.

5. Cost Efficiency (10% weight) — How many tokens did the agent consume? We compare against historical baselines and flag outliers.

Honestly, the weight distribution shifts over time. Start with accuracy and completeness, then add the others once you have baseline data.

The Python Template We Use

Here’s the simplified version of our eval runner. We run this in CI/CD via GitHub Actions.

python
import json
import hashlib
from datetime import datetime
from typing import Any, Dict, List
from openai import OpenAI

class AgentEvalPipeline:
    def __init__(self, test_cases_path: str, agent_runner: callable):
        with open(test_cases_path) as f:
            self.test_cases = json.load(f)
        self.agent_runner = agent_runner
        self.client = OpenAI()
        self.results = []

    def run(self) -> Dict[str, Any]:
        for tc in self.test_cases:
            result = self._evaluate_single(tc)
            self.results.append(result)
        return self._aggregate()

    def _evaluate_single(self, tc: Dict) -> Dict:
        actual = self.agent_runner(tc["input"])
        expected = tc["expected"]

        accuracy = self._semantic_similarity(actual["output"], expected["output"])
        completeness = self._check_fields(actual["output"], expected["required_fields"])
        format_ok = self._validate_schema(actual["output"], expected.get("schema"))
        latency = actual["latency_ms"]
        tokens = actual["total_tokens"]

        passed = accuracy >= 0.85 and completeness >= 1.0 and format_ok

        return {
            "id": tc["id"],
            "passed": passed,
            "accuracy": accuracy,
            "completeness": completeness,
            "format_ok": format_ok,
            "latency_ms": latency,
            "tokens": tokens,
            "actual_output": actual["output"],
        }

    def _semantic_similarity(self, a: str, b: str) -> float:
        resp = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=[a, b]
        )
        emb_a = resp.data[0].embedding
        emb_b = resp.data[1].embedding
        return self._cosine_similarity(emb_a, emb_b)

    def _aggregate(self) -> Dict[str, Any]:
        total = len(self.results)
        passed = sum(1 for r in self.results if r["passed"])
        return {
            "timestamp": datetime.utcnow().isoformat(),
            "total_tests": total,
            "passed": passed,
            "failed": total - passed,
            "pass_rate": round(passed / total * 100, 2),
            "avg_accuracy": round(
                sum(r["accuracy"] for r in self.results) / total, 4
            ),
            "avg_latency_ms": round(
                sum(r["latency_ms"] for r in self.results) / total, 1
            ),
            "avg_tokens": round(
                sum(r["tokens"] for r in self.results) / total, 0
            ),
            "results": self.results,
        }

That’s 55 lines. It’s not production-ready yet — you need caching for embeddings, parallel execution, and persistent storage — but it’s the skeleton that got us started.

The Regression Tracker

We store every eval run in PostgreSQL. A simple table:

sql
CREATE TABLE agent_eval_runs (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    branch TEXT NOT NULL,
    commit_sha TEXT NOT NULL,
    config_hash TEXT NOT NULL,
    pass_rate NUMERIC(5,2),
    avg_accuracy NUMERIC(6,4),
    avg_latency_ms NUMERIC(8,1),
    avg_tokens INTEGER,
    run_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_eval_runs_config ON agent_eval_runs(config_hash, run_at DESC);

Every time a change touches an agent prompt or model config, we compute a hash of the full configuration and store it. When the pass rate drops by more than 5% compared to the last run with the same config hash, the pipeline fails the PR.

That’s the key. You can’t just report regressions — you have to block them.

What We Caught (and What We Missed)

In the first four weeks of running this pipeline on a live logistics multi-agent system, here’s what happened:

Total test cases: 247

Regressions caught: 19

False positives: 3

False negatives: 2

Effective catch rate: 92.3%

The 19 regressions included:

A prompt change that caused the address validation agent to reject valid PO boxes (6 cases)
A model update that made the carrier selection agent consistently choose the most expensive option (4 cases)
A routing logic change that dropped the `zone` field from the shipping order output (5 cases)
A context window configuration that truncated the delivery instructions mid-sentence (4 cases)

The two false negatives? Both involved edge cases where the input data was so unusual that even our human evaluators disagreed on the correct output. We added those to a separate “needs human review” category.

More importantly, the pipeline cut our prompt-tuning iteration cycle by 70%. Before evals, we’d tweak a prompt, manually test 5-10 cases, then deploy and hope. After evals, we’d run 247 tests in 90 seconds and get a clear pass/fail signal. That’s a game changer.

But here’s what really matters: we stopped shipping regressions to production. Zero agent-related incidents in the eight weeks after deployment. Before that? At least one per sprint.

Why You Should Do This Today

I’ve seen teams spend weeks tuning prompts, swapping models, and arguing about which agent framework is best. Meanwhile, they have zero tests. Zero.

Look, your multi-agent system is already complex enough. The agents drift. The models change. The prompts degrade. Without an evaluation pipeline, you’re operating on vibes and hope.

A few practical recommendations:

Start with 50 test cases. Don’t over-engineer. Pick the most common user journeys and the three weirdest edge cases you’ve seen in production. You can always add more.

Use semantic similarity, not exact match. Agents rephrase. That’s fine. Measure meaning, not syntax.

Fail the build on regressions. If a change drops your pass rate below your threshold, block the merge. Your future self will thank you.

Run evals on a schedule. Model providers update their models silently. Your agents might break overnight. A daily cron eval catches that before users do.

Track everything. Store every eval run, every config hash, every metric. When something breaks, you’ll have the data to trace it back.

The ECOA AI Edge

We built this evaluation pipeline with a team of three senior Vietnamese developers from our Can Tho hub. They’re all English-fluent, product-aware engineers who’ve shipped multi-agent systems for logistics, fintech, and e-commerce clients.

The team cost us $9,000/month total. That’s three senior devs for the price of one in San Francisco.

And because they use the ECOA AI Platform ACP for orchestration, they achieve roughly 5x efficiency on eval pipeline development — the platform handles agent routing, state management, and observability out of the box. We don’t rebuild infrastructure. We ship value.

If you’re running a multi-agent system in production without an eval pipeline, you’re gambling. Maybe you’ve been lucky so far. But luck isn’t a deployment strategy.

—

Frequently Asked Questions

How many test cases do I need to start an agent evaluation pipeline?

Start with 30-50 test cases covering your most common user flows and 5-10 edge cases from production incidents. Add 10-20 new cases per sprint as you discover failure modes. A small but representative test set is far better than no tests at all.

Should I use LLM-as-a-judge for evaluations, or deterministic checks?

Both. Use deterministic checks (JSON schema validation, field presence, latency thresholds) for what you can measure objectively. Use LLM-as-a-judge (semantic similarity, rubric-based scoring) for subjective quality. Our setup uses about 60% deterministic checks and 40% LLM-based evaluation. This balances cost with coverage.

How do I handle flaky test cases where the agent passes sometimes and fails other times?

Set a minimum run count per test case — we use 3 runs per case in CI. Mark the test as “passing” if ≥ 2 of 3 runs pass. Store the individual run results in your database so you can detect nondeterministic patterns. If a test is flaky more than 20% of the time, split it into simpler sub-cases or adjust the evaluation threshold.

Can this evaluation pipeline work with any agent orchestration framework?

Yes. The pipeline is framework-agnostic. It calls your agent runner as a black box function — pass in input, get back output. We’ve used it with LangGraph, CrewAI, AutoGen, and our own ECOA AI Platform ACP. The eval pipeline cares about inputs and outputs, not how the agent internally routes work.

Your Multi-Agent System Has No Tests: How We Built an Agent Evaluation Pipeline That Caught 92% of Regressions Before Production

Your Multi-Agent System Has No Tests: How We Built an Agent Evaluation Pipeline That Caught 92% of Regressions Before Production

Vietnam Outsourcing: The Data-Driven Case for Choosing Vietnam as Your Offshore Dev Hub

Why Smart CTOs Hire Vietnamese Developers in 2025

Why Agent Testing Is Different from Code Testing