We Cut a SaaS Company’s Cloud Bill by $1.2M/Year Using Multi-Agent Orchestration — A Vietnam Offshore Case Study

(Case Studies) - How a 4-person Vietnamese team using ECOA AI Platform ACP built a multi-agent cloud cost optimization system that slashed a SaaS company's AWS bill by $1.2M/year—8 weeks from kickoff to production.

We Cut a SaaS Company’s Cloud Bill by $1.2M/Year Using Multi-Agent Orchestration — A Vietnam Offshore Case Study

I’ve seen a lot of cloud cost optimization tools. Most of them are glorified dashboards that tell you what’s already broken.

They don’t *act*.

Stop Dreading Legacy Code: How AI Assisted Debugging and Refactoring Saves Your Sanity

Stop Dreading Legacy Code: How AI Assisted Debugging and Refactoring Saves Your Sanity

This article explores how AI assisted debugging and refactoring tools reduce production bugs by 40% and cut development… ...

Earlier this year, a mid-stage SaaS company came to us with a problem you’d expect from a Series C startup burning cash: $3.2M/month on AWS, with no clear visibility into where 40% of that spend was going. They had 3 cloud engineers manually reviewing Cost Explorer reports. They were drowning.

Here’s what we actually built, how the Vietnamese team pulled it off in 8 weeks, and why multi-agent orchestration was the only sane approach.

Why You Should Hire Vietnamese Developers: A CTO’s Guide to Offshore Tech Talent

Why You Should Hire Vietnamese Developers: A CTO’s Guide to Offshore Tech Talent

TL;DR: Vietnam is rapidly becoming the top choice for offshore development. With 530,000+ software engineers, strong English skills,… ...

The Problem: Cloud Sprawl at Scale

This wasn’t a small deployment. The client ran:

  • 320+ EC2 instances across 4 environments (prod, staging, QA, dev)
  • 17 RDS databases, most of them over-provisioned
  • 14TB of EBS snapshots with no retention policy
  • 6 Kubernetes clusters with chaotic resource requests
  • 40+ Lambda functions with memory settings that hadn’t been tuned in 18 months

Their monthly burn: $3.2M. Their estimated waste: at least 35%.

They’d tried third-party tools. CloudHealth, CloudCheckr, the usual suspects. But the recommendations were generic. “Right-size this instance.” No context. No automation. No one had time to implement them.

They needed something that could analyze, recommend, and execute — without a human in the loop for every damn decision.

Why a Multi-Agent System Made Sense

A single monolithic script to handle this would be a nightmare. Too many data sources, too many decision paths, too many conflicting optimization strategies (reserved instances vs. spot instances vs. right-sizing).

We needed specialized agents that could each own a slice of the problem.

Here’s the architecture we landed on:

Agent Roles

Agent Responsibility Data Source
Usage Analyzer Profile instance utilization, identify over/under-provisioned resources CloudWatch metrics, Compute Optimizer
Rightsizing Optimizer Generate instance type recommendations with risk scores Analyzer output, pricing API
Reserved/Spot Planner Evaluate RI/SP purchases vs. On-Demand costs Pricing API, usage history
Anomaly Detector Flag unusual spend spikes or configuration drift Cost Explorer, CloudTrail
EBS Snapshot Janitor Identify orphaned/obsolete snapshots EC2 API, snapshot metadata
Orchestrator (ECOA ACP) Schedule agent runs, merge recommendations, apply actions All of the above

Each agent was a Python async service, containerized, and orchestrated via ECOA AI Platform ACP. The Orchestrator controlled the workflow: analyze → evaluate → recommend → approve → execute.

How the Vietnamese Team Built It

We staffed this with 4 developers from our Can Tho hub — two mid-level backend engineers, one senior DevOps specialist, and one junior Python developer. All of them were ECOA AI Platform ACP certified.

The kicker? We estimated this at 16-20 weeks with a traditional team. We delivered in 8 weeks.

Here’s what made the difference:

Week 1-2: Agent Scaffolding with ACP

The ECOA ACP orchestration SDK has a concept called Agent Blueprints — pre-built templates for common agent patterns. The Usage Analyzer and Snapshot Janitor are practically commodities. The Rightsizing Optimizer needed custom logic, but the team scaffolded it from a Blueprint in 4 hours.

python
# Simplified Rightsizing Agent blueprint usage
from ecoa.agents import AgentBlueprint, agent_registry

@agent_registry.register("rightsizing")
class RightsizingOptimizer(AgentBlueprint):
    def __init__(self):
        super().__init__(
            agent_id="rightsizing-v1",
            input_schema={
                "instances": {"type": "list", "required": True},
                "pricing_data": {"type": "dict", "required": False}
            },
            output_schema={
                "recommendations": {"type": "list"},
                "savings_estimate": {"type": "float"}
            }
        )
    
    async def run(self, payload):
        # Custom right-sizing logic using p3.instant types
        # ...
        return {"recommendations": recs, "savings_estimate": total_savings}

The Orchestrator in ACP handled inter-agent communication, state persistence, and error recovery. We didn’t write a single line of message queue code.

Week 3-4: Data Pipeline and Anomaly Detection

The Anomaly Detector needed real-time ingestion from CloudTrail. The senior DevOps engineer built a Kinesis → Lambda → S3 → Athena pipeline in 5 days. It processes ~200 GB of CloudTrail logs per day and surfaces anomalies within 2 minutes.

I’m not exaggerating when I say this: the junior developer built the Snapshot Janitor agent in 3 days. It was that straightforward with the Blueprint system. She’d never worked with AWS EC2 APIs before. ACP’s built-in AWS connectors handled the auth and pagination.

Week 5-6: Orchestration and Approval Workflows

The Orchestrator in ECOA ACP supports human-in-the-loop gates. We configured it so that:

  • Low-risk recommendations (e.g., deleting orphaned snapshots under 10GB) execute automatically.
  • Medium-risk actions (right-sizing non-production instances) require a Slack approval.
  • High-risk changes (modifying production RDS instances) need a ticket in Jira.

This was configured in ACP’s workflow editor — a YAML file, not boilerplate code:

yaml
workflow: cloud_cost_optimizer
agents:
  - usage_analyzer
  - rightsizing
  - reserved_planner
  - anomaly_detector
  - snapshot_janitor
gates:
  - name: snapshot_auto_clean
    condition: "snapshot.age_days > 30 AND snapshot.size_gb < 10"
    action: auto_approve
  - name: production_rightsize
    condition: "recommendation.environment == 'prod'"
    action: jira_ticket
    jira_project: CLOUDOPS

Week 7-8: Testing and Rollout

The team ran the system in shadow mode for 2 weeks — all recommendations logged, nothing executed. They compared the system's output against the client's manual audits.

Accuracy: 94%. The 6% false positives were mostly edge cases (instances in Auto Scaling groups that were about to scale down). We tuned a filter for that and moved to partial execution by week 8.

The Results: $1.2M/Year Saved

After 3 months of full production:

Optimization Monthly Savings Confidence
Right-sizing 87 EC2 instances $48,000 High
Reserved/Spot instance coverage $62,000 High
Orphaned EBS snapshot cleanup $4,200 High
Lambda memory tuning $1,800 High
RDS instance rightsizing $12,000 Medium

Total: $128,000/month. $1.536M/year. After accounting for the medium-confidence recommendations (some RDS changes were rolled back), the client realized $1.2M in annual savings.

More importantly: the client's cloud team went from 3 engineers drowning in Cost Explorer to 1 engineer monitoring the agent outputs. The other two got reassigned to product work.

Why This Worked (and It's Not Just the Agents)

Let's be honest. Multi-agent orchestration is trendy. But execution matters.

The real win here was speed of delivery. We could have built this with in-house US engineers for $200K/month. Instead, the Vietnamese team delivered at $12K/month total (4 developers at our senior/mid rates). And they shipped 2x faster than our initial estimate.

That's not a cost arbitrage story — it's a competence plus platform story. The ECOA ACP eliminated boilerplate. The Vietnamese engineers brought the engineering rigor. Together, they turned a 5-month project into a 2-month sprint.

I've worked with offshore teams from India, the Philippines, and Eastern Europe. The difference with Vietnam, specifically our Can Tho hub, is agency. These developers don't wait for specs. They propose solutions, challenge assumptions, and ship code that works.

When the Anomaly Detector started flagging false positives from spot instance interruptions, the team's senior DevOps engineer rewrote the filtering logic overnight. He didn't ask permission. He just fixed it.

That's the kind of developer you want on cost optimization. They treat your AWS bill like it's their own money.

Frequently Asked Questions

How does multi-agent orchestration differ from a simple script or Lambda function for cloud cost optimization?

A script is static. It runs, it reports, it's done. Multi-agent orchestration with ECOA ACP allows specialized agents to communicate, share state, and make decisions based on evolving conditions. The Anomaly Detector can trigger a fresh analysis from the Rightsizing Optimizer when it sees a spend spike — that's impossible with a cron job. You get reactive, not just scheduled, optimization.

What's the minimum cloud spend required for this approach to make financial sense?

If your monthly AWS bill is under $30K, manual Cost Explorer reviews are probably fine. Above $50K/month, the waste becomes significant enough to justify a multi-agent system. We've seen clients at $80K/month save $25K-$30K/month with this approach. The breakeven point is around $50K/month when you factor in team costs and implementation time.

Do you need to grant the agents full write access to AWS?

Absolutely not. We never give agents direct write access to production resources. All destructive actions (instance modification, snapshot deletion) go through the human-in-the-loop gates configured in ECOA ACP. Agents can only read data and generate recommendations. Execution requires explicit approval — either automated (for low-risk actions) or manual (for high-risk changes). This is baked into the platform, not bolted on.

How long does it typically take to set up the data pipeline for a new client?

The data pipeline (CloudWatch, CloudTrail, Cost Explorer → storage → agent feeds) takes about 1 week for most clients. The majority of that time is setting up IAM roles and permissions, not writing code. ECOA ACP includes pre-built connectors for all major AWS services, so the integration work is minimal. If you already have AWS Organizations and consolidated billing enabled, we can start agent training by day 5.

Related reading: Outsourcing Software in 2025: Why Vietnam Beats India for Elite Engineering Teams

Leave a Comment

Your email address will not be published. Required fields are marked *

Ready to Build with AI-Powered Developers?

Hire Vietnamese engineers augmented by ECOA AI Platform + Claude Code. 5x faster, 40% cheaper.