We Cut a Legacy Fintech’s Batch Processing from 4 Hours to 12 Minutes — Here’s the Exact Architecture We Used

Let me be blunt: 4-hour batch jobs are a ticking time bomb.

I’ve seen it happen too many times. A fintech company grows fast. Their nightly batch processing window shrinks. One day, a job that used to finish by 3 AM starts bleeding into market open. The ops team panics. The CTO gets a 2 AM phone call.

Building Production-Ready Multi-Agent AI Systems: Lessons from the Trenches

TL;DR: Multi-agent AI systems promise autonomous task execution, but orchestrating them in production is hard. This post shares… ...

That’s exactly the situation we walked into with a US-based payments startup. They processed 1.2 million transactions per day through a legacy PostgreSQL monolith. Every night, a cron job ran a series of massive SQL aggregations, reconciliation steps, and report generations. It took 4 hours and 13 minutes on average.

Some nights it failed. Some nights it took 6 hours. Every night, it cost them.

How We Cut Cloud Costs by 40% While Migrating to Microservices: A Real Vietnam Offshore Case Study

How We Cut Cloud Costs by 40% While Migrating to Microservices: A Real Vietnam Offshore Case Study Let… ...

Here’s how we fixed it with a team of 4 senior Vietnamese engineers and the ECOA AI Platform, and cut that time down to 12 minutes and 40 seconds.

The Real Problem Wasn’t the Queries

When we first looked at the system, everyone blamed the SQL. “We need better indexing.” “We need faster hardware.”

Honestly? The queries weren’t even that bad. The real issue was architecture.

The system was doing everything in a single sequential pipeline. Step A had to finish before Step B could start. If Step C failed, the entire job rolled back and restarted from scratch. It was fragile, it was slow, and it was expensive.

Here’s a simplified view of what the original pipeline looked like:


1. Load raw transactions from main DB (45 min)
2. Aggregate by merchant (60 min)
3. Calculate fees and commissions (55 min)
4. Generate settlement files (40 min)
5. Send to partner banks (30 min)
6. Update ledger balances (23 min)

Each step was a single-threaded PostgreSQL function. Total: 4h 13m.

The Fix: Event Streaming + Parallel Agent Orchestration

We didn’t rewrite the entire system. That’s a trap I’ve seen teams fall into — they spend 6 months rebuilding and the business dies of old age.

Instead, we took a strangler fig approach. We extracted the core processing logic into independent, parallelizable units.

Step 1: Decouple the Reads from the Writes

First, we moved transaction ingestion into Apache Kafka. Instead of querying the main DB for 45 minutes, we streamed new transactions in real-time. This was a 2-week change with our Vietnamese team.

The result? Zero nightly load time. The data was already there.

Step 2: Parallelize with Stateful Agents

This is where it got interesting. We used the ECOA AI Platform’s agent orchestration to create specialized processing agents. Each agent handled one domain:

Merchant Aggregation Agent — grouped transactions by merchant ID
Fee Calculation Agent — applied pricing rules
Settlement Agent — generated bank-compatible files
Ledger Agent — updated balance records

These agents ran in parallel, not sequentially. Each one consumed from the same Kafka topic but processed independently.

Here’s the key: they shared state through a Redis-backed state store. If the Fee Calculation Agent needed merchant metadata, it didn’t query the main DB — it read from a pre-warmed cache.

Step 3: Add Circuit Breakers and Retry Logic

We learned this the hard way. In the first week, one agent silently failed and we lost 2,000 transactions. Not good for a fintech.

We added:

Circuit breakers — if an agent fails 3 times in 5 minutes, it stops and alerts
Dead letter queues — failed records go to a separate topic for manual review
Idempotency keys — every transaction has a unique ID, so replaying is safe

python
# Simplified agent orchestration pattern
class TransactionProcessorAgent:
    def __init__(self, name, process_fn):
        self.name = name
        self.process_fn = process_fn
        self.circuit_breaker = CircuitBreaker(failure_threshold=3, reset_timeout=300)
    
    async def process(self, transaction):
        async with self.circuit_breaker:
            result = await self.process_fn(transaction)
            return result

The Numbers That Matter

After the migration, here’s what we measured over a 30-day period:

Metric	Before	After	Improvement
Total batch time	4h 13m	12m 40s	95% faster
Cloud compute cost (nightly)	$847	$254	70% reduction
Failed jobs per month	7	0	100% reliability
Data loss incidents	3	0	Eliminated

But the biggest win? The ops team stopped getting paged at 3 AM. That’s hard to quantify, but it’s real.

Why a Vietnamese Team Made This Work

We could have done this with any offshore team. But the speed and quality we got from our ECOA AI developers in Ho Chi Minh City was exceptional.

Here’s what I noticed:

They asked the right questions. On day one, the lead engineer asked: “What’s the rollback strategy if Kafka goes down?” Most teams just start coding.
They owned the architecture. I didn’t micromanage. They proposed the agent-based design, I just approved it.
They used the ECOA platform to prototype fast. The AI agent orchestration let them spin up a working pipeline in 3 days. We spent the remaining time hardening it.

To be fair, we also had a senior engineer in Can Tho who had worked with Kafka for 6 years. That experience matters more than hourly rates.

The One Thing I’d Do Differently

If I could go back, I’d start with a smaller scope. We tried to migrate the entire pipeline in one sprint. It worked, but it was stressful.

Better approach: pick the slowest step (the merchant aggregation at 60 minutes), convert it to an agent, prove the pattern works, then expand.

Don’t boil the ocean. Boil the kettle.

Frequently Asked Questions

How long did the actual migration take?

The core migration took 8 weeks with a team of 4 Vietnamese engineers. The first 2 weeks were Kafka setup and data validation. Weeks 3-6 were agent development and testing. Weeks 7-8 were production cutover and monitoring.

Did you have any data loss during the cutover?

No. We ran the old and new systems in parallel for 5 days. We compared outputs transaction-by-transaction. Only when we had 100% match for 3 consecutive nights did we turn off the old pipeline.

What was the biggest technical challenge?

Handling late-arriving transactions. Some partner banks send transaction data up to 24 hours after the actual event. Our original batch job just ignored them. The new system had to handle out-of-order events gracefully. We used Kafka’s timestamp-based ordering with a 12-hour grace window.

Could this work for non-fintech systems?

Absolutely. The same pattern applies to any system with heavy nightly batch processing — logistics, healthcare claims, e-commerce order reconciliation. The key is identifying which steps can run in parallel and which need sequential ordering.

We Cut a Legacy Fintech’s Batch Processing from 4 Hours to 12 Minutes — Here’s the Exact Architecture We Used

We Cut a Legacy Fintech’s Batch Processing from 4 Hours to 12 Minutes — Here’s the Exact Architecture We Used

Building Production-Ready Multi-Agent AI Systems: Lessons from the Trenches

How We Cut Cloud Costs by 40% While Migrating to Microservices: A Real Vietnam Offshore Case Study

The Real Problem Wasn’t the Queries

The Fix: Event Streaming + Parallel Agent Orchestration