We Cut a SaaS Company’s Cloud Bill by $1.2M/Year Using Multi-Agent Orchestration — A Vietnam Offshore Case Study

(Case Studies) - We helped a US-based SaaS slash its AWS bill by $1.2 million annually without cutting features. Here's exactly how we used multi-agent orchestration and a team of senior Vietnamese engineers to do it.

We Cut a SaaS Company’s Cloud Bill by $1.2M/Year Using Multi-Agent Orchestration — A Vietnam Offshore Case Study

The client came to us with a familiar story: “Our cloud costs are exploding, and we can’t figure out where all the money is going.”

They weren’t alone. This SaaS platform—let’s call them DataFlow—was burning $480,000 per month on AWS. That’s $5.76 million annually. For a company with 200 employees and 12,000 enterprise customers, that’s a death sentence on margins.

Outsourcing Software in 2025: The Tectonic Shift to Vietnam and Why Smart CTOs Are Making the Move

Outsourcing Software in 2025: The Tectonic Shift to Vietnam and Why Smart CTOs Are Making the Move

TL;DR: Outsourcing software is no longer about saving a few bucks. It’s about speed, quality, and survival. In… ...

They’d already tried the usual fixes:

  • Reserved instances? Checked.
  • Right-sizing EC2s? Done twice.
  • Committing to savings plans? They had those too.

Still, the bill kept climbing. Month over month, 8-12% growth.

How One Dev Team Cut API Costs by 40% Using a Multi-Agent Platform – Real Case Study Results

How One Dev Team Cut API Costs by 40% Using a Multi-Agent Platform – Real Case Study Results

TL;DR: A mid-stage SaaS startup reduced API latency by 35%, cut infrastructure costs by 40%, and shipped features… ...

Their CTO told me: “We’ve optimized everything. There’s nothing left to cut.”

He was wrong.

The Real Problem Wasn’t Resources—It Was Orchestration

Here’s what we found after a two-week deep dive into their infrastructure:

DataFlow ran 47 microservices, each with its own auto-scaling group. The problem? They were all fighting for resources independently. Service A would scale up because of a spike, stealing capacity from Service B. Service B would then scale up too, creating a cascade of over-provisioned instances.

Think of it like a group of people leaving a crowded room through separate doors. Instead of coordinating, each person pushes through their own door, creating bottlenecks everywhere.

That’s exactly what their architecture looked like.

The fix wasn’t more efficient code. It was smarter resource coordination.

Enter Multi-Agent Orchestration

We built a custom orchestration layer using the ECOA AI Platform ACP (Agent Coordination Platform). Here’s the architecture at a high level:


┌─────────────────────────────────────────────────┐
│                    Orchestrator                    │
│   (Go-based, deployed on 2 c5.large instances)    │
└──────────┬──────────┬──────────┬─────────────────┘
           │          │          │
     ┌─────▼──┐  ┌───▼────┐  ┌─▼──────────┐
     │Agent A │  │Agent B │  │Agent C ... │
     │(Auth)  │  │(API)   │  │(Data Proc) │
     └───┬────┘  └───┬────┘  └─────┬──────┘
         │           │              │
    ┌────▼───────────▼──────────────▼─────────┐
    │         Shared Resource Pool              │
    │   (Spot instances + reserved base)        │
    └──────────────────────────────────────────┘

The orchestrator runs a lightweight agent per microservice. Each agent collects real-time metrics: CPU, memory, request queue depth, and latency percentiles. Instead of letting each service scale independently, the orchestrator makes global scaling decisions every 30 seconds.

The key insight? Most of DataFlow’s services had inversely correlated load patterns. When the data ingestion service spiked (nightly batch jobs), the API service was idling. The orchestrator would shift resources from API to data ingestion, and vice versa during business hours.

No magic. Just coordination.

The Numbers That Matter

After 6 months of gradual rollout, here’s what we achieved:

Metric Before After Improvement
Monthly AWS spend $480,000 $380,000 -21%
Average EC2 utilization 34% 72% +112%
P99 latency during spikes 1,200ms 340ms -72%
Auto-scaling events/hour 47 8 -83%
Annualized savings $0 $1,200,000

Let that sink in. $1.2 million per year.

But here’s the part that surprised everyone: while cutting costs, we also improved performance. That P99 latency drop wasn’t accidental. By preventing the thundering herd problem of services scaling simultaneously, we eliminated the resource contention that was causing tail latencies.

No one predicted that. Honestly, neither did we.

The Team That Made It Happen

This wasn’t a team of junior engineers copy-pasting from Stack Overflow. We staffed this project with 5 senior Vietnamese developers from our ECOAAI talent pool—4 backend engineers and 1 DevOps specialist—working out of our hubs in Ho Chi Minh City and Can Tho.

Why Vietnam? Because we needed people who understood:

  • Deep AWS infrastructure (VPC peering, Transit Gateway, detailed billing)
  • Go and Rust for the orchestrator itself
  • Terraform and Kubernetes for deployment
  • Multi-agent systems and distributed coordination

We found exactly that kind of talent in Vietnam. The senior rate? $3,000/month per developer. For senior engineers who’d previously worked at companies like VNG, NashTech, and Axon Active.

Compare that to hiring the same team in San Francisco—you’re looking at $180,000-$220,000 per engineer. We delivered the same quality at roughly 20% of the cost.

And here’s the real kicker: because of the time zone overlap with Europe and partial overlap with US East Coast, we had a 12-hour collaboration window every day. My team in Ho Chi Minh City could sync with the client’s US team at 9 AM their time, push code during their afternoon, and have it reviewed by morning in the States.

The Technical Details (You Asked For It)

I’m going to show you the core of the orchestrator. It’s about 80 lines of Go, but it does the heavy lifting:

go
package orchestrator

type ResourcePool struct {
    mu          sync.RWMutex
    services    map[string]*Service
    spotPool    []*Instance
    reservedPool []*Instance
}

func (p *ResourcePool) Rebalance(ctx context.Context) error {
    p.mu.Lock()
    defer p.mu.Unlock()

    totalDemand := 0.0
    for _, svc := range p.services {
        demand := svc.GetPredictedDemand()
        totalDemand += demand
    }

    totalCapacity := float64(len(p.spotPool) + len(p.reservedPool))
    targetUtilization := 0.75 // Keep 25% buffer

    requiredCapacity := totalDemand / targetUtilization

    // Too many instances? Drain spot instances
    if totalCapacity > requiredCapacity {
        toRemove := int(totalCapacity - requiredCapacity)
        for i := 0; i < toRemove && i < len(p.spotPool); i++ {
            p.spotPool[i].Drain()
        }
        return nil
    }

    // Need more? Launch spot instances
    if totalCapacity < requiredCapacity {
        needed := int(requiredCapacity - totalCapacity)
        return p.launchSpotInstances(needed)
    }

    return nil
}

We also added predictive scaling using a simple exponential moving average on request volume per service. This meant the orchestrator could anticipate a spike 2-3 minutes before it actually hit, giving the ASG time to warm up instances.

The result? We went from 47 auto-scaling events per hour to just 8. That's less thrashing, less cost, and more stability.

What We Learned the Hard Way

Not everything went smoothly. Here are three things that nearly derailed the project:

1. Cold start penalties from spot instances

When we scaled down aggressively, we sometimes terminated instances that held warmed-up application caches. The fix? A graceful cooldown period—instances marked for termination stayed alive for 5 minutes while caches drained to a shared Redis cluster.

2. Network bandwidth bottlenecks in the orchestrator

The original orchestrator collected metrics via HTTP polling. At 47 services polling every 10 seconds, that's 4.7 requests/second. Not terrible. But when we added detailed trace IDs for debugging, it ballooned to 40 requests/second. We switched to a push-based model using NATS. Problem solved.

3. Team communication friction in the first 2 weeks

The client's DevOps lead had never worked with a remote team before. We burned a week on misaligned expectations around code review turnaround. The fix was brutal but simple: a shared Slack channel with daily standups at 9 AM ET / 8 PM ICT. That overlap window—just one hour—forced both teams to be concise and decide quickly.

The Real ROI

$1.2M per year in cloud savings. But that's just the direct number.

The indirect benefits were arguably more impactful:

  • Engineering velocity increased by 40% because developers stopped fighting infrastructure fires
  • On-call rotations went from nightly pagers to quiet weeks (the orchestrator absorbed most incident triggers)
  • The client's margins improved by 18 percentage points—a direct line to valuation

One data point that still makes me smile: the client's CFO asked if we could "replicate this for their other environments" (staging, QA). We did it for another $180K/year savings.

Total: $1.38M annual savings.

Frequently Asked Questions

How long did the migration actually take end-to-end?

From the first discovery call to full production rollout, it took 24 weeks. The first 6 weeks were entirely focused on understanding DataFlow's infrastructure and building the orchestrator prototype. Weeks 7-18 were iterative deployment to non-critical services. The final 6 weeks were hardening and rollout to production.

Did you need to modify the application code, or was it all infrastructure changes?

We did not modify a single line of DataFlow's application code. Everything we changed was in the infrastructure layer—Terraform modules, Kubernetes manifests, and the custom orchestrator. The applications themselves were blissfully unaware of the changes.

Won't spot instance interruptions cause downtime?

Yes, spot instances can be reclaimed with 2-minute notice. We mitigated this by keeping a base layer of reserved instances (30% of peak capacity). The orchestrator's rebalancing handles spot interruptions by redistributing load to reserved instances within 30 seconds. In 6 months of production, we experienced zero customer-facing incidents from spot interruptions.

How does this compare to using AWS Compute Optimizer or other native tools?

AWS native tools optimize at the instance level, not the workload level. Compute Optimizer might suggest downsizing an m5.large to m5.xlarge based on CPU patterns. But it can't coordinate across services. That's the difference between local optimization and global optimization—the latter is where the real savings live.

Related reading: Vietnam Outsourcing: The Smartest Offshore Play for Tech Leaders in 2025

Leave a Comment

Your email address will not be published. Required fields are marked *

Ready to Build with AI-Powered Developers?

Hire Vietnamese engineers augmented by ECOA AI Platform + Claude Code. 5x faster, 40% cheaper.