We Didn’t Rewrite the Code. We Orchestrated the Data: A Multi-Agent Case Study with a Vietnamese Team

You don’t have a data problem. You have an orchestration problem.

That’s the conclusion we reached after our first week with DataPulse, a mid-market SaaS company running a real-time analytics platform. They came to us with a familiar complaint: their dashboards were inconsistent, reports took hours to generate, and their engineering team was burning out trying to maintain 12 different microservices, each with its own database and a slightly different interpretation of what “active user” meant.

Why Your Multi-Agent System Hangs (And How to Fix It with Timeouts, Retries, and Circuit Breakers)

Why Your Multi-Agent System Hangs (And How to Fix It with Timeouts, Retries, and Circuit Breakers) You’ve built… ...

The standard playbook would have been a six-month monolith-to-microservices refactor. Or a complete data warehouse migration. Both expensive, both risky, both doomed to take longer than anyone expected.

We didn’t do that.

Why Smart Tech Leaders Hire Vietnamese Developers: The 2025 Reality Check

TL;DR: Vietnam is quietly becoming the most reliable offshore engineering hub in Asia. Top tech leaders Hire Vietnamese… ...

Instead, we built a multi-agent orchestration layer using the ECOA AI Platform ACP. Three specialized agents, one Vietnamese team based in Ho Chi Minh City, and three weeks from kickoff to production.

Here’s exactly how we did it.

The Problem: 12 Sources of Truth

DataPulse had grown fast. Really fast. They acquired two smaller competitors in 18 months, and the engineering team did what any sensible team would do: they kept shipping. New features went into new services. New data went into new databases. Nobody had time to refactor.

The result? A data pipeline that looked like a plate of spaghetti thrown at a wall.

User events went into PostgreSQL
Billing data lived in a separate MySQL instance
Marketing attribution was in BigQuery
Support tickets were in a third-party API with webhook callbacks

Each service reported its own version of the truth. The sales team saw 12,000 active users. The product team saw 8,500. The CEO saw 10,200. Everyone was right, technically. But nobody trusted the numbers.

How often have you seen that?

The Diagnosis: Don’t Move the Data, Orchestrate the Access

We could have built a massive ETL pipeline. We could have dumped everything into Snowflake and run dbt transformations. But that would have taken months and introduced a single point of failure. More importantly, it would have required convincing DataPulse’s CTO to let us touch production databases. That wasn’t going to happen.

Instead, we asked a different question: *What if we didn’t move the data at all? What if we just asked each service the right question, at the right time, and assembled the answer on the fly?*

That’s where multi-agent orchestration shines.

The Agent Architecture

We designed three specialized agents on the ECOA ACP. Each had a single responsibility, a clearly defined context window, and access to exactly one data source.

Agent 1: The Collector

Responsible for polling all 12 microservice APIs. It wasn’t smart. It just needed to be reliable. It used a simple polling loop with exponential backoff, a Redis-based state store to track what it had already pulled, and a circuit breaker to prevent cascading failures when one service was slow.

Agent 2: The Aggregator

This one had the hard job. It received raw events from the Collector, normalized them into a common schema, and identified conflicts. When two services reported different “user counts,” the Aggregator had to decide—based on a configurable confidence score—which one to trust. We baked in a simple rule engine: billing data always wins for revenue metrics, event logs win for engagement.

Agent 3: The Anomaly Detector

The watchdog. It compared the Aggregator’s output against historical baselines. If the numbers looked wrong (a sudden 10x drop in daily active users, for example), it raised a flag and paused the pipeline instead of silently passing bad data through.

Here’s the key: none of these agents were running inside DataPulse’s infrastructure. They ran on our orchestration layer. The only thing we deployed inside their VPC was a lightweight read-only replica for the Collector agent.

The Fix: Three Weeks, Zero Production Changes

We kicked off the project with a team of three Vietnamese developers from our Ho Chi Minh City hub. Two mid-level backend engineers and one senior DevOps engineer. All vetted, all native English speakers, all experienced with Python and async workflows.

Week 1: The team mapped out every API endpoint, every data schema, and every edge case. They found six undocumented fields and two deprecated endpoints that were still returning data. We documented everything in a shared Notion doc that DataPulse’s engineering team could review.

Week 2: The agents were built and tested against staging environments. The Collector agent handled 500,000 events per day during load testing with a p99 latency of 300ms. The Aggregator resolved conflicts with 97% accuracy during manual validation (DataPulse’s team spot-checked 1,000 records).

Week 3: Production rollout. We used a canary strategy: the agents ran in parallel with the existing pipeline for 48 hours, comparing outputs. When the discrepancy rate dropped below 0.5%, we flipped the switch.

The results?

Dashboard consistency went from “somewhat trusted” to 99.2% agreement across all metrics
Report generation dropped from 4 hours to 47 seconds
The engineering team stopped holding weekly “data reconciliation” meetings

Measuring the Impact: What Actually Changed

Let’s be honest. The real win wasn’t just the speed. It was the maintenance reduction. DataPulse’s team had been spending roughly 30% of their sprint capacity on data pipeline bugs. After the agent deployment, that dropped to under 5%.

Metric	Before	After
Dashboard consistency	~70%	99.2%
Report generation time	4 hours	47 seconds
Data pipeline bugs per sprint	8-12	0-1
SRE incidents (monthly)	6	1

The agents also gave DataPulse something they didn’t expect: a clear audit trail. Every decision the Aggregator made was logged. When the sales team asked “why did this number change?”, we could trace it back to the exact conflict resolution rule that fired.

Lessons Learned (The Hard Way)

We made mistakes. Here are the ones that matter.

The Collector nearly DDoSed a production API.

We set the polling interval too aggressively during load testing. The agent was hitting an endpoint every 200ms. That API wasn’t designed for that throughput. We added rate limiting and a dynamic backoff mechanism inside the agent workflow. Now the Collector adjusts its polling frequency based on the response times of the downstream service. Fast service? Poll faster. Slow service? Back off.

The Anomaly Detector was too sensitive at first.

It flagged a 15% drop in new user signups as an anomaly. Turned out it was just a weekend. We added a day-of-week normalization layer to the baseline model. The fix took one junior developer about four hours.

Don’t underestimate the documentation.

DataPulse’s engineering team wanted to understand how the agents worked, not just trust a black box. We spent a full day writing internal documentation and doing a walkthrough. That investment paid for itself within a week when the CTO had to explain the architecture to their board.

Why the Vietnamese Team Made the Difference

Could we have hired local developers for this? Sure. But the cost differential was significant. Our three-person team cost DataPulse roughly $6,000/month in total. A comparable team in the US would have been $35,000-$45,000.

But it wasn’t just cost. The developers we assigned had deep experience with async Python, Redis, and API integration patterns. They’d built similar orchestration layers before. One of them had spent two years building data pipelines for a logistics startup in Can Tho. That experience—solving real problems under real budget constraints—shaped how they approached this project. They weren’t theoretical. They were practical.

The timezone overlap with US Eastern Time was roughly 11 hours. That meant we had 6 hours of overlap each day. It was enough. Communication was handled through a dedicated Slack channel, daily standups at 9 PM EST, and detailed async documentation for everything else.

Frequently Asked Questions

Q: When would you recommend multi-agent orchestration over a traditional ETL pipeline?

A: When you can’t or shouldn’t centralize your data. If you’re dealing with production databases that you can’t touch, legacy systems that are too risky to migrate, or a fast-growing startup where schema changes happen weekly, orchestration is the safer bet. It’s also faster to deploy—weeks vs months. The tradeoff is that orchestration introduces latency (300ms in our case) and requires robust error handling. For real-time dashboards, that’s usually fine. For large batch analytics, a traditional ETL pipeline still wins.

Q: How do you prevent the agents from introducing security vulnerabilities?

A: The agents run outside the client’s infrastructure. They connect via read-only replicas or API endpoints with scoped API keys. We also have a security audit layer built into the ECOA ACP that logs every API call and every data transformation. If something looks suspicious—like an agent suddenly querying a table it shouldn’t—the system flags it immediately. In three years of production deployments, we’ve never had a security incident.

Q: What’s the minimum team size required to maintain this kind of multi-agent system in production?

A: One senior backend engineer, part-time. The agents are stateless—they crash, they restart, they pick up where they left off because the state lives in Redis. We’ve had deployments run for 6+ months with zero maintenance. When something does break, it’s usually a downstream API change. The fix is typically a config update, not a code change. If you’re building a complex system with 15+ agents, you’ll want a dedicated DevOps person for the first month. After that, it’s fire-and-forget.