We Cut a Fintech Startup’s AI Token Costs by 67% with a Multi-Model Routing Strategy — An ECOA AI Case Study
Let me start with a number that hurts: $12,430.
That’s how much a single fintech client was spending per month on OpenAI’s GPT-4 API. They were a seed-stage startup building a real-time document analysis tool for loan underwriting. Every PDF, every balance sheet, every tax return got thrown at the most expensive model available.
Build a Custom AI PR Reviewer with Claude API and GitHub Webhooks — Here’s the Exact Code
Build a Custom AI PR Reviewer with Claude API and GitHub Webhooks — Here’s the Exact Code I’ve… ...
Why? Because they didn’t have a choice.
Or so they thought.
Outsourcing Software Development: Why Smart CTOs Are Betting on Vietnam in 2025
TL;DR: Outsourcing software development isn’t dead—it’s evolving. The smartest CTOs now prefer Vietnam for offshore engineering. You’ll get… ...
We came in with a different idea. Actually, it’s an idea that’s obvious in retrospect: not every task needs a PhD-level model. Why run a simple regex-like extraction through GPT-4 when a fine-tuned mini-LM or even GPT-4o-mini can do the job at 5% of the cost?
Here’s how we built it, why it worked, and the exact architecture we deployed. This is a true story out of our ECOA AI team in Ho Chi Minh City.
The Problem: One Model for Every Job
The startup’s pipeline looked like this:
- User uploads a financial document (PDF, scanned image, CSV).
- Document goes through OCR.
- *Everything* gets chunked and sent to GPT-4 with a long system prompt.
- GPT-4 extracts structured data: income, expenses, risk flags.
- A downstream rules engine makes the final underwriting decision.
The problem wasn’t accuracy—GPT-4 is great. The problem was cost per document. A typical loan application might generate 8,000–15,000 input tokens and 2,000–4,000 output tokens. At GPT-4 pricing ($30/1M input tokens, $60/1M output tokens), that’s roughly $0.36 to $0.69 per document. Do that for 500 documents a day, and you’re at $180–$345 daily. Over a month? Brutal.
But here’s the thing—does every request need that horsepower?
We analyzed 10,000 documents from their production logs. The findings were stark:
| Task Type | % of Requests | Average Complexity | Best Model Fit |
|---|---|---|---|
| Simple key-value extraction (dates, numbers) | 34% | Very low | GPT-4o-mini |
| Standard table parsing | 28% | Low | GPT-4o-mini or fine-tuned BERT |
| Multi-field reconciliation | 22% | Medium | GPT-4o |
| Ambiguous or flagged documents | 16% | High | GPT-4 |
That’s right: over 60% of their volume was being processed by a model that was massive overkill.
The Solution: A Cost-Aware Multi-Model Router
We built a lightweight routing layer. It sits between the document ingestion pipeline and the LLM calls. Here’s the core concept—inspired by how load balancers work, but for intelligence cost.
The router runs a three-tier decision:
- Task Type Classification: A fast, cheap classifier (we used a tiny DistilBERT model running on CPU) predicts what kind of extraction this chunk needs.
- Complexity Estimation: Based on the task type and chunk metadata (size, number of entities, ambiguity flags), the router assigns a complexity score.
- Model Selection: A simple rule-based YAML config maps complexity thresholds to models.
We deployed this using the ECOA AI Platform ACP to orchestrate the agent calls. The ACP handled the routing logic, retries, and fallbacks across models from OpenAI, Anthropic, and one local Llama 3.1 8B instance for ultra-cheap batch jobs.
The Routing Configuration (Simplified)
Here’s the exact YAML we used in the ACP orchestrator definition:
yaml
routes:
- task_type: "key_value"
max_complexity: 0.3
model: "openai/gpt-4o-mini"
fallback_model: "anthropic/claude-3-haiku"
max_retries: 2
- task_type: "table_parse"
max_complexity: 0.4
model: "openai/gpt-4o-mini"
fallback_model: "openai/gpt-4o"
- task_type: "reconciliation"
max_complexity: 0.7
model: "openai/gpt-4o"
fallback_model: "anthropic/claude-3-sonnet"
- task_type: "ambiguous"
max_complexity: 1.0
model: "openai/gpt-4"
fallback_model: "anthropic/claude-3-opus"
The classification step runs a small ONNX model we trained on 2,000 labeled chunks from the client’s own data. It’s not rocket science. It’s a simple sentence-transformer embedding plus a logistic regression head. Inference time: ~8ms per chunk on a single core. Cost: negligible.
What We Tracked
We instrumented every call with OpenTelemetry. The ECOA ACP has built-in tracing, so we could see exactly which model handled each request and its cost impact.
Key metrics after deploying to production:
- Total monthly spend dropped from $12,430 to $4,102. That’s a 67% reduction.
- Average latency per document improved by 300ms. Why? GPT-4o-mini runs faster than GPT-4.
- Accuracy on simple tasks actually went up slightly. The smaller model was fine-tuned on their specific extraction formats. It hallucinated less on edge cases.
- Complex document accuracy stayed flat. The router correctly passed the hard stuff to the heavy models.
Honestly, the latency improvement was a surprise benefit. We’d optimized for cost, but the simpler models are just quicker.
A Key Insight: The Router Must Be Fast and Cheap
The biggest risk with a routing layer is that the router itself becomes a bottleneck. If the classification step takes 500ms, you’ve eaten up your savings.
We solved this by:
- Running the classifier locally on CPU using ONNX Runtime. No API calls, no cold starts.
- Keeping the classification model tiny: DistilBERT with a single linear layer. Around 67MB total.
- Caching routing decisions for identical document types. If you see the same bank statement template 50 times, classify it once.
The average routing decision cost? 0.00004 cents in compute. You read that right.
What If Your Orchestrator Could Become Cost-Aware?
This is the question I want every dev reading this to ask themselves.
Most teams treat LLM calls as a fixed cost. They pick one model and throw everything at it. But your orchestrator doesn’t have to be dumb. You can make it cost-aware, latency-aware, and accuracy-aware with a few hundred lines of configuration.
We’re doing this now with every new client on the ECOA AI Platform. The router is a first-class concept in ACP. You define the models, the conditions, and the fallbacks. The platform handles the rest.
The Vietnamese team in Can Tho that built this integration with us was fantastic. They caught a subtle bug in the fallback logic during testing: if GPT-4o-mini returned a partial result on a reconciliation task, the router would forward it to GPT-4o anyway, wasting money. They introduced a partial result threshold—if the confidence score is below 0.85, skip the fallback and re-route to the correct model directly. Saved another 8% on that task type alone.
The Hard Part: Model Context Awareness
There’s a gotcha. The router needs to know what the chunk contains, not just its length. You can’t route purely on token count. A 50-token chunk of a messy handwritten form is harder than a 500-token chunk of a clean digital PDF.
So we added a simple metadata field: `is_structured_hint`. The OCR pipeline outputs whether the source document has structured layout (tables, labeled fields) or unstructured (free text, handwriting). Structured chunks went to the simpler models. Unstructured ones got escalated.
This one metadata flag improved routing accuracy by 12% and cut misrouted expensive calls by a third.
Results Summary
Here’s the bottom line:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Monthly API cost | $12,430 | $4,102 | -67% |
| Average document latency | 1.2s | 0.9s | +25% speed |
| Simple task accuracy | 94.1% | 95.8% | +1.7% |
| Complex task accuracy | 97.3% | 97.1% | -0.2% (negligible) |
| Misrouted expensive calls | N/A | 4.3% | Acceptable |
The client’s CTO told us later: “I was afraid we’d lose accuracy on the complex stuff. We didn’t. And the savings? That’s a full senior dev salary we can reinvest.”
He’s not wrong. At our rates, that’s the cost of a senior Vietnamese developer for over a month.
Frequently Asked Questions
Q: Does the routing layer add significant latency to simple requests?
No. The DistilBERT classifier runs in ~8ms on CPU. The routing logic itself is a config lookup—under 1ms. The total overhead is negligible compared to any LLM call. For simple tasks routed to GPT-4o-mini, the end-to-end time actually *decreases* because the cheaper model is faster.
Q: How do you handle cases where the classifier misidentifies a task type?
Our config has a `fallback_model` field. If the chosen model returns a response with low confidence (below 0.8 on a simple confidence calibration), the ECOA ACP re-routes to the fallback. We also log all misclassifications weekly to retrain the classifier. It’s a closed feedback loop.
Q: Can this work with open-source models too?
Absolutely. We initially tested a Llama 3.1 8B hosted on a single A10G for the cheap tasks. It worked, but the latency was higher than GPT-4o-mini due to GPU contention. For batch jobs where latency isn’t critical, it’s a viable option. The routing config supports any OpenAI-compatible endpoint, including vLLM or Ollama.
Q: What’s the minimum traffic volume for this to make sense?
We’ve seen it pay off at around 200 documents per day. Below that, the complexity of maintaining the routing layer and classifier doesn’t justify the savings. But for any team running 500+ LLM calls daily, the ROI is immediate and obvious.
Related reading: Outsourcing software: Why Smart CTOs Are Ditching the ‘Cheap Labor’ Myth (And Building Elite Remote Teams)