The Hidden Context Tax in Multi-Agent Systems: How Selective Context Injection Cut Our Token Costs by 44%
You’ve built a multi-agent system. You’re proud of it. The agents chat, route tasks, and seem to work. But if you look at your LLM API bills, you’ll probably see a problem.
You’re throwing every piece of context at every agent.
Why You Should Hire Vietnamese Developers: The Undisputed Truth for 2025
TL;DR: Vietnam offers the highest ROI for offshore software development in 2025. Combining competitive costs with a strong… ...
It happens because it’s easy. Take the whole conversation history, all the user data, the previous tool outputs, and shove it into the system prompt of every new agent call. The agents get it all. They don’t complain. They just burn tokens.
But this “context tax” is silently killing your performance and your budget. Here’s the hard truth: most agents don’t need most of the context. They just need the relevant slice. Feed them the whole pizza, and they’ll choke on the crust.
From 200ms to 50ms: How We Helped a Fintech Startup Scale Without Breaking the Bank
From 200ms to 50ms: How We Helped a Fintech Startup Scale Without Breaking the Bank Honestly, I’ve seen… ...
The Problem: Context Blindness and Token Bloat
A few months back, we were running a multi-agent workflow for a fintech client in Ho Chi Minh City. The pipeline had five agents:
- Intent analyzer (classifies user request)
- Data retriever (queries databases)
- Compliance checker (validates against regulations)
- Response generator (writes the answer)
- Quality reviewer (double-checks the output)
The first version worked. Sort of. But each agent received the *entire* conversation history, plus a 5KB system prompt with all possible context fields. The token count per task averaged 12,000 tokens. At GPT-4 pricing, that’s about $0.36 per task. For 10,000 tasks a day? $3,600 daily. On one client.
We knew it was wasteful, but the real kicker came later: agents started hallucinating more as context grew. The compliance checker would reference a piece of user data that it shouldn’t even have seen.
That’s when we realized: context overload doesn’t just cost money. It degrades quality.
The Slice: Selective Context Injection
We built a lightweight context router on top of the ECOA AI Platform ACP. It’s a middleware layer that sits between the orchestrator and each agent. Its job is simple: determine exactly what context each agent needs, and inject only that.
Here’s the high-level architecture:
[Orchestrator] -> [Context Router] -> [Agent A] (only context slice A)
-> [Agent B] (only context slice B)
-> [Agent C] (only context slice C)
The router uses a small, cheap model (we used a fine-tuned DistilBERT) to classify incoming task types and map them to context templates. The templates are YAML configs that specify which fields from the global context to include.
Example config snippet for the Compliance Checker agent:
yaml
agent: compliance_checker
context_slice:
include_fields:
- user.region
- user.tier
- transaction.amount
- transaction.currency
- policy.rules
exclude_fields:
- user.chat_history
- user.preferences
- transaction.internal_notes
max_tokens: 2000
Notice what’s missing: the entire chat history. The compliance checker doesn’t need to know the user’s favorite color or the previous conversational flow. It needs the transaction details and the regulatory rules. That’s it.
We also added a `max_tokens` cap per agent. If a context slice exceeds it (rare, but possible with deeply nested documents), we truncate intelligently—keeping the most recent and most relevant parts first.
What We Measured
We rolled this out across the fintech pipeline for a week. Here are the raw numbers:
| Metric | Before (Full Context) | After (Selective Injection) | Change |
|---|---|---|---|
| Avg tokens per task | 12,400 | 6,950 | -44% |
| Avg latency per task | 3.2s | 2.1s | -34% |
| Agent hallucination rate | 5.8% | 2.3% | -60% |
| Compliance error rate | 2.1% | 0.4% | -81% |
| USDC cost per 10K tasks | $3,720 | $2,085 | -44% |
The token cost reduction was exactly proportional to the context reduction. No surprises there. But the latency drop and the hallucination reduction were unexpected wins.
Why did latency drop? Because shorter prompts mean faster generation. GPT-4 and similar models process tokens linearly; the first token time is proportional to prompt length. Cut the prompt in half, and you literally wait half as long.
Why fewer hallucinations? Because the agents had less irrelevant noise. The compliance checker no longer had to parse through a wall of chat history to find the regulation line. It’s like asking a lawyer to read an entire novel just to find the one relevant paragraph—not efficient, and they might miss it.
Implementation Details (You Can Steal This)
You don’t need a fancy platform to implement selective context injection. A simple Python class can do the trick. We built ours around a `ContextRouter` that reads a config like the one above.
Here’s a stripped-down version of how we structured it:
python
class ContextRouter:
def __init__(self, config_path: str):
with open(config_path) as f:
self.configs = yaml.safe_load(f)
def get_context_slice(self, agent_name: str, global_context: dict) -> dict:
config = self.configs.get(agent_name)
if not config:
return global_context # fallback to full context
slice = {}
for field in config['include_fields']:
value = self._get_nested(global_context, field)
if value is not None:
self._set_nested(slice, field, value)
return slice
def _get_nested(self, d: dict, path: str):
keys = path.split('.')
for key in keys:
d = d.get(key)
if d is None:
return None
return d
def _set_nested(self, d: dict, path: str, value):
keys = path.split('.')
for key in keys[:-1]:
d = d.setdefault(key, {})
d[keys[-1]] = value
This is the core logic. The `include_fields` list defines the exact paths into the global context. The router extracts only those paths. We also implemented a `max_tokens` check that truncates the longest string fields if the total exceeds the limit.
But the real magic is in the *discovery* of those field lists. How do you know what each agent needs? Don’t guess. We ran a few hundred tasks through the system logging *all* fields accessed by each agent. Then we analyzed the logs.
Turns out, most agents access fewer than 20% of the available context fields. Some agents (like the intent analyzer) need even less—just the user’s current query.
A Real Story: The Compliance Blocker
Let me give you a concrete example from that fintech project. The *Compliance Checker* agent was originally given the full transcript of a user’s 45-minute chat session. The agent’s job was to verify that the requested transaction didn’t violate anti-money laundering rules.
The full transcript contained details about the user’s weekend plans, small talk about the weather, and a discussion about their cat. The compliance check failed 12% of the time because the agent got confused: it tried to apply compliance rules to the cat story.
Seriously. It flagged a conversation about pet adoption as a potential money laundering scheme.
After selective injection—the agent only got the transaction details and the relevant regulatory rules—the false positive rate dropped to less than 1%. The cat is safe.
Why Most Teams Don’t Do This
It’s not complex. The reason is cultural. Most engineering teams treat context like unlimited resource. They think: *more context = better decisions*. That’s true only up to a point. Past that, it’s diminishing returns, then negative returns.
There’s also the “it works, don’t touch it” mentality. If the pipeline runs, people move on to the next fire. But the hidden tax adds up. Over six months, that fintech client saved over $280,000 in LLU costs just by implementing selective context. Plus their user satisfaction improved because responses were faster and more accurate.
Can you afford to ignore that math?
The ECOA AI Platform ACP Advantage
We built the first version manually, but we’ve since integrated this pattern into the ECOA AI Platform ACP. If you’re using ACP, you can define context slices declaratively in your workflow YAML:
yaml
agents:
compliance_checker:
model: gpt-4o-mini
context_slice:
- user.region
- transaction.amount
- transaction.currency
- policy.rules
max_context_tokens: 1500
The platform automatically applies the slice at runtime. No code changes needed.
Our team in Can Tho actually helped design the initial data analysis for field usage patterns. They ran the logging pipeline across 50,000 tasks to build a heatmap of context field access per agent type. That data directly informed our default templates.
What You Should Do Right Now
Stop. Open your production multi-agent system’s logs. Look at the prompt sent to each agent. Ask yourself: does this agent really need all of this?
If the answer is “I don’t know,” you have a problem. Build a logging layer that tracks which context fields each agent actually uses. Run it for 1,000 tasks. You’ll be shocked at what you find.
Then start implementing selective injection. Start with one agent. Measure the token savings. You’ll get buy-in from your CTO when you show the cost reduction.
It’s not glamorous work. But it’s the kind of optimization that separates production-grade multi-agent systems from toy demos.
*We’ve been running selective context injection in production for eight months now. Token costs are down 44%. Agent accuracy is up. And the compliance checker finally stopped caring about cats.*
Frequently Asked Questions
How do I determine which context fields each agent needs?
Run a logging analytics session across a representative sample of tasks. Log every field accessed by each agent. Then review the logs manually or use a script to generate a “field usage heatmap.” Most teams find that agents access less than 20% of the available context. Those 20% are your include list.
Will selective context injection break my multi-agent workflow?
It shouldn’t if implemented carefully. Start by adding a fallback: if a required field is missing from the context slice, the router falls back to the full context for that particular task. We also recommend a dry-run mode where you log what *would* be injected without changing the actual prompt. Gradually reduce the fallback percentage as you build confidence.
Can I use selective context injection with any LLM model?
Yes. The technique is model-agnostic. It works with GPT-4, Claude, Gemini, or any open-source model. The savings are proportional to the model’s token pricing. Cheaper models (like GPT-4o-mini) still benefit from reduced latency and fewer hallucinations, even if the absolute dollar savings are lower.
How does ECOA AI Platform ACP handle context injection at scale?
ACP uses a distributed context router that scales horizontally. Each agent instance gets a dedicated context slice computed from the global state store. The platform caches context templates per agent type, so there’s minimal overhead. We benchmarked it at 10,000 requests per second with less than 5ms added latency per injection.
Related reading: Outsourcing Software Development Without the Headaches: A CTO’s Playbook for 2025
Related reading: Why Smart CTOs Hire Vietnamese Developers: A Data-Driven Playbook for 2025