We Cut a SaaS Company’s Cloud Bill by $1.2M/Year Using Multi-Agent Orchestration — A Vietnam Offshore Case Study
I’ve seen a lot of cloud cost optimization tools. Most of them are glorified dashboards that tell you what’s already broken.
They don’t *act*.
Stop Dreading Legacy Code: How AI Assisted Debugging and Refactoring Saves Your Sanity
This article explores how AI assisted debugging and refactoring tools reduce production bugs by 40% and cut development… ...
Earlier this year, a mid-stage SaaS company came to us with a problem you’d expect from a Series C startup burning cash: $3.2M/month on AWS, with no clear visibility into where 40% of that spend was going. They had 3 cloud engineers manually reviewing Cost Explorer reports. They were drowning.
Here’s what we actually built, how the Vietnamese team pulled it off in 8 weeks, and why multi-agent orchestration was the only sane approach.
Why You Should Hire Vietnamese Developers: A CTO’s Guide to Offshore Tech Talent
TL;DR: Vietnam is rapidly becoming the top choice for offshore development. With 530,000+ software engineers, strong English skills,… ...
The Problem: Cloud Sprawl at Scale
This wasn’t a small deployment. The client ran:
- 320+ EC2 instances across 4 environments (prod, staging, QA, dev)
- 17 RDS databases, most of them over-provisioned
- 14TB of EBS snapshots with no retention policy
- 6 Kubernetes clusters with chaotic resource requests
- 40+ Lambda functions with memory settings that hadn’t been tuned in 18 months
Their monthly burn: $3.2M. Their estimated waste: at least 35%.
They’d tried third-party tools. CloudHealth, CloudCheckr, the usual suspects. But the recommendations were generic. “Right-size this instance.” No context. No automation. No one had time to implement them.
They needed something that could analyze, recommend, and execute — without a human in the loop for every damn decision.
Why a Multi-Agent System Made Sense
A single monolithic script to handle this would be a nightmare. Too many data sources, too many decision paths, too many conflicting optimization strategies (reserved instances vs. spot instances vs. right-sizing).
We needed specialized agents that could each own a slice of the problem.
Here’s the architecture we landed on:
Agent Roles
| Agent | Responsibility | Data Source |
|---|---|---|
| Usage Analyzer | Profile instance utilization, identify over/under-provisioned resources | CloudWatch metrics, Compute Optimizer |
| Rightsizing Optimizer | Generate instance type recommendations with risk scores | Analyzer output, pricing API |
| Reserved/Spot Planner | Evaluate RI/SP purchases vs. On-Demand costs | Pricing API, usage history |
| Anomaly Detector | Flag unusual spend spikes or configuration drift | Cost Explorer, CloudTrail |
| EBS Snapshot Janitor | Identify orphaned/obsolete snapshots | EC2 API, snapshot metadata |
| Orchestrator (ECOA ACP) | Schedule agent runs, merge recommendations, apply actions | All of the above |
Each agent was a Python async service, containerized, and orchestrated via ECOA AI Platform ACP. The Orchestrator controlled the workflow: analyze → evaluate → recommend → approve → execute.
How the Vietnamese Team Built It
We staffed this with 4 developers from our Can Tho hub — two mid-level backend engineers, one senior DevOps specialist, and one junior Python developer. All of them were ECOA AI Platform ACP certified.
The kicker? We estimated this at 16-20 weeks with a traditional team. We delivered in 8 weeks.
Here’s what made the difference:
Week 1-2: Agent Scaffolding with ACP
The ECOA ACP orchestration SDK has a concept called Agent Blueprints — pre-built templates for common agent patterns. The Usage Analyzer and Snapshot Janitor are practically commodities. The Rightsizing Optimizer needed custom logic, but the team scaffolded it from a Blueprint in 4 hours.
python
# Simplified Rightsizing Agent blueprint usage
from ecoa.agents import AgentBlueprint, agent_registry
@agent_registry.register("rightsizing")
class RightsizingOptimizer(AgentBlueprint):
def __init__(self):
super().__init__(
agent_id="rightsizing-v1",
input_schema={
"instances": {"type": "list", "required": True},
"pricing_data": {"type": "dict", "required": False}
},
output_schema={
"recommendations": {"type": "list"},
"savings_estimate": {"type": "float"}
}
)
async def run(self, payload):
# Custom right-sizing logic using p3.instant types
# ...
return {"recommendations": recs, "savings_estimate": total_savings}
The Orchestrator in ACP handled inter-agent communication, state persistence, and error recovery. We didn’t write a single line of message queue code.
Week 3-4: Data Pipeline and Anomaly Detection
The Anomaly Detector needed real-time ingestion from CloudTrail. The senior DevOps engineer built a Kinesis → Lambda → S3 → Athena pipeline in 5 days. It processes ~200 GB of CloudTrail logs per day and surfaces anomalies within 2 minutes.
I’m not exaggerating when I say this: the junior developer built the Snapshot Janitor agent in 3 days. It was that straightforward with the Blueprint system. She’d never worked with AWS EC2 APIs before. ACP’s built-in AWS connectors handled the auth and pagination.
Week 5-6: Orchestration and Approval Workflows
The Orchestrator in ECOA ACP supports human-in-the-loop gates. We configured it so that:
- Low-risk recommendations (e.g., deleting orphaned snapshots under 10GB) execute automatically.
- Medium-risk actions (right-sizing non-production instances) require a Slack approval.
- High-risk changes (modifying production RDS instances) need a ticket in Jira.
This was configured in ACP’s workflow editor — a YAML file, not boilerplate code:
yaml
workflow: cloud_cost_optimizer
agents:
- usage_analyzer
- rightsizing
- reserved_planner
- anomaly_detector
- snapshot_janitor
gates:
- name: snapshot_auto_clean
condition: "snapshot.age_days > 30 AND snapshot.size_gb < 10"
action: auto_approve
- name: production_rightsize
condition: "recommendation.environment == 'prod'"
action: jira_ticket
jira_project: CLOUDOPS
Week 7-8: Testing and Rollout
The team ran the system in shadow mode for 2 weeks — all recommendations logged, nothing executed. They compared the system's output against the client's manual audits.
Accuracy: 94%. The 6% false positives were mostly edge cases (instances in Auto Scaling groups that were about to scale down). We tuned a filter for that and moved to partial execution by week 8.
The Results: $1.2M/Year Saved
After 3 months of full production:
| Optimization | Monthly Savings | Confidence |
|---|---|---|
| Right-sizing 87 EC2 instances | $48,000 | High |
| Reserved/Spot instance coverage | $62,000 | High |
| Orphaned EBS snapshot cleanup | $4,200 | High |
| Lambda memory tuning | $1,800 | High |
| RDS instance rightsizing | $12,000 | Medium |
Total: $128,000/month. $1.536M/year. After accounting for the medium-confidence recommendations (some RDS changes were rolled back), the client realized $1.2M in annual savings.
More importantly: the client's cloud team went from 3 engineers drowning in Cost Explorer to 1 engineer monitoring the agent outputs. The other two got reassigned to product work.
Why This Worked (and It's Not Just the Agents)
Let's be honest. Multi-agent orchestration is trendy. But execution matters.
The real win here was speed of delivery. We could have built this with in-house US engineers for $200K/month. Instead, the Vietnamese team delivered at $12K/month total (4 developers at our senior/mid rates). And they shipped 2x faster than our initial estimate.
That's not a cost arbitrage story — it's a competence plus platform story. The ECOA ACP eliminated boilerplate. The Vietnamese engineers brought the engineering rigor. Together, they turned a 5-month project into a 2-month sprint.
I've worked with offshore teams from India, the Philippines, and Eastern Europe. The difference with Vietnam, specifically our Can Tho hub, is agency. These developers don't wait for specs. They propose solutions, challenge assumptions, and ship code that works.
When the Anomaly Detector started flagging false positives from spot instance interruptions, the team's senior DevOps engineer rewrote the filtering logic overnight. He didn't ask permission. He just fixed it.
That's the kind of developer you want on cost optimization. They treat your AWS bill like it's their own money.
Frequently Asked Questions
How does multi-agent orchestration differ from a simple script or Lambda function for cloud cost optimization?
A script is static. It runs, it reports, it's done. Multi-agent orchestration with ECOA ACP allows specialized agents to communicate, share state, and make decisions based on evolving conditions. The Anomaly Detector can trigger a fresh analysis from the Rightsizing Optimizer when it sees a spend spike — that's impossible with a cron job. You get reactive, not just scheduled, optimization.
What's the minimum cloud spend required for this approach to make financial sense?
If your monthly AWS bill is under $30K, manual Cost Explorer reviews are probably fine. Above $50K/month, the waste becomes significant enough to justify a multi-agent system. We've seen clients at $80K/month save $25K-$30K/month with this approach. The breakeven point is around $50K/month when you factor in team costs and implementation time.
Do you need to grant the agents full write access to AWS?
Absolutely not. We never give agents direct write access to production resources. All destructive actions (instance modification, snapshot deletion) go through the human-in-the-loop gates configured in ECOA ACP. Agents can only read data and generate recommendations. Execution requires explicit approval — either automated (for low-risk actions) or manual (for high-risk changes). This is baked into the platform, not bolted on.
How long does it typically take to set up the data pipeline for a new client?
The data pipeline (CloudWatch, CloudTrail, Cost Explorer → storage → agent feeds) takes about 1 week for most clients. The majority of that time is setting up IAM roles and permissions, not writing code. ECOA ACP includes pre-built connectors for all major AWS services, so the integration work is minimal. If you already have AWS Organizations and consolidated billing enabled, we can start agent training by day 5.
Related reading: Outsourcing Software in 2025: Why Vietnam Beats India for Elite Engineering Teams