We Slashed a SaaS’s AI Token Costs by 67% with Selective Context Injection — A Vietnam Offshore Case Study

Context is king. But in a multi-agent system, context is also the most expensive thing you can ship. We learned this the hard way.

A B2B SaaS client in the US came to us last quarter. Their product — an AI-assisted compliance workflow tool — was bleeding money on API calls. Their monthly OpenAI bill had hit $14,000, and it was growing 12% month over month. The CTO told me, “If this keeps up, we have to deprioritize the AI features entirely.”

GitHub Security for Open Source Projects: A Maintainer’s Guide to Dependabot, Secret Scanning, and CodeQL in 2026

GitHub Security for Open Source Projects: A Maintainer’s Guide to Dependabot, Secret Scanning, and CodeQL in 2026 I’ve… ...

That’s when we brought in our team from Ho Chi Minh City and the ECOA AI Platform ACP.

The Problem: The Hidden Context Tax

Here’s what was happening. Their multi-agent system had five specialized agents: an intent parser, a regulation lookup agent, a clause extractor, a risk assessor, and a summary generator.

How to Build Reliable AI Agent Pipelines That Actually Work in Production

TL;DR: Building reliable AI agent pipelines requires more than just chaining LLM calls. This article shares battle-tested patterns—modular… ...

On paper, that’s a clean architecture. But here’s the dirty secret no one talks about.

Every single agent was receiving the full conversation history plus the entire regulation database context for every call. Even when the intent parser only needed the last user message. Even when the summary generator only needed the risk assessor’s output.

We ran an audit. 72% of all tokens consumed were wasted on irrelevant context. The intent parser alone was burning 11,000 tokens per call just to parse a 50-word user query.

If every agent gets the same firehose of data, you’re not orchestrating — you’re DDOSing your own wallet.

The Architecture: Selective Context Injection

Our team in Vietnam — specifically three mid-level engineers and one senior architect — built a context router on top of ECOA AI Platform ACP. The core idea is dead simple: each agent declares what context it actually needs, and the router injects only that.

Here’s the architecture in a nutshell:

python
class ContextRouter:
    def __init__(self, agent_registry):
        self.agents = agent_registry
    
    def route(self, agent_name, user_input, global_memory):
        agent_schema = self.agents[agent_name]
        # Only pull keys the agent actually declared
        relevant_keys = agent_schema['context_keys']
        context = {k: global_memory[k] for k in relevant_keys}
        
        # Inject only the last 2 turns for conversational agents
        if agent_schema['type'] == 'conversational':
            context['history'] = global_memory['history'][-2:]
        
        return context

No magic. No AI-hype. Just a clear contract between agents and the runtime.

We also built a static context analyzer that runs during CI. It flags any agent that requests more context than it used in the last 100 test runs. This catches bloat before it hits production.

The Numbers: What Actually Changed

Let’s talk hard data. Before and after deployment:

Metric	Before	After	Reduction
Avg tokens per agent call	12,400	3,900	68.5%
Monthly OpenAI spend	$14,200	$4,686	67%
Avg response latency	4.2s	2.5s	40.4%
Agent error rate	3.1%	1.8%	41.9%

But honestly, the latency improvement surprised me more. By cutting out the noise, the models started returning cleaner output on the first attempt. We went from an average of 1.4 retries per agent to 0.3 retries.

That’s not a small gain. Fewer retries mean fewer API calls. It compounds.

Why the Vietnamese Team Mattered

You could argue that selective context injection is a well-known pattern. True. So why did it take two months of internal struggle before they called us?

*Communication.* The existing team kept overcomplicating the solution. They wanted to build a custom ML model to “dynamically predict context relevance.” Our senior engineer in Can Tho looked at the data, did a quick grep of the conversation logs, and said:

“Why predict what we already know? They declare their needs. Inject accordingly. Ship in two weeks.”

We shipped in 10 days.

This is the ECOAAI edge. You get engineers who are technical, pragmatic, and fluent in English. They don’t need to be managed on every detail. They read the requirements, push back when there’s a simpler path, and execute without the theater.

The Orchestration Layer: ECOA AI Platform ACP

The selective context router runs as a dedicated proxy agent within the ECOA AI Platform ACP. The platform handles the heavy lifting — retry logic with exponential backoff, distributed tracing via OpenTelemetry, and a priority queue for high-stakes compliance checks.

We used ACP’s built-in state snapshotting to cache the most common regulation lookups. This cut 80% of the lookup calls entirely. The CTO told me, “I honestly thought caching regulation data would be harder. ACP just… handled it.”

To be fair, nothing is perfect. We hit one issue where the agent registry schema wasn’t versioned properly, causing a brief window of mismatched contexts. That was a one-hour fix — we added a `schema_version` field and a validation webhook in the CI pipeline.

What I’d Do Differently

If I had to redo this project, I’d start with a context budget for each agent from day one. Every agent gets a hard limit on the number of tokens it can request. If a design requires more tokens, you either refactor the agent or justify it in a formal review.

Second, I’d instrument every agent call with a context utilization metric. How many of the injected tokens actually influenced the output? We built this post-facto, but it would have saved us a week of analysis.

The Real Takeaway

Multi-agent systems are not just about routing. They’re about data diet. The cheaper your agents can think, the more you can scale without burning cash.

We’ve now deployed this pattern for three other clients. One fintech startup saw a 58% cost reduction. A health-tech platform cut theirs by 71%.

This is what happens when you pair pragmatic Vietnamese engineering with a platform that eliminates the coordination tax. You don’t just outsource code. You outsource smart decisions.

—

Frequently Asked Questions

Q: Does selective context injection work with any LLM provider?

Yes. The pattern is provider-agnostic. We built the initial pipeline using OpenAI’s GPT-4o, but the context router just constructs a smaller prompt. We’ve since tested it with Claude 3.5 Sonnet, Gemini 1.5 Pro, and open-source models like Llama 3.1. The cost savings are similar — between 50-70%.

Q: What happens if an agent needs unexpected context in production?

That’s specifically what the static context analyzer catches in CI. If a test run shows an agent accessing keys it didn’t declare, the pipeline fails the build. For genuine production edge cases, we added a `fallback_expand` mechanism — the router can dynamically include extra context on request, but it logs every such fallback to a separate monitoring channel for review.

Q: How long did the integration take from contract to production?

Roughly 7 weeks. Three weeks for initial architecture review and schema design. Two weeks for implementation and integration testing. Two more weeks for load testing and tuning the context budgets. The Vietnamese team completed the bulk of the implementation in 10 days, but the client’s internal compliance review added some buffer time.