We Cut a Legacy Fintech’s Batch Processing from 4 Hours to 12 Minutes — Here’s the Exact Architecture We Used
I hate batch jobs.
Not because they’re technically hard. But because they’re a confession. A big, blinking sign that says: *”We couldn’t figure out real-time, so we just throw everything in a cron job at 2 AM and pray.”*
How to Build Reliable AI Agent Pipelines (Without Losing Your Mind)
TL;DR: Building reliable AI agent pipelines means more than just chaining LLM calls. You need proper error handling,… ...
Recently, a fintech client in the US came to us with exactly that problem. Their nightly batch processing pipeline — handling transaction reconciliation, fraud scoring, and ledger updates — was taking 4 hours and 17 minutes on average. Some nights it hit 6 hours. Their ops team was basically on fire watch every single night.
The worst part? The batch was processing less than 2 million transactions per run. That’s not a big data problem. That’s an architecture problem.
Why Top CTOs Hire Vietnamese Developers: A Cost-Effective Tech Talent Strategy
TL;DR: Vietnam is rapidly becoming a top destination for offshore software development. Developers here combine strong technical skills… ...
We rebuilt the entire pipeline with a team of 4 senior developers based in Can Tho, Vietnam, using the ECOA AI Platform ACP for agent orchestration. The result: 12 minutes and 32 seconds end-to-end. No data loss. No missed transactions.
Here’s exactly how we did it. No fluff. Just the architecture.
The Original Architecture (The Problem)
The legacy system was a textbook example of “it worked in 2015, so we never touched it.”
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ PostgreSQL │────▶│ Monolith │────▶│ CSV Export │
│ (Source) │ │ (Ruby on │ │ (S3 Bucket) │
│ │ │ Rails) │ │ │
└─────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ Batch Job │
│ (Sidekiq + │
│ Cron) │
└──────────────┘
│
▼
┌──────────────┐
│ PostgreSQL │
│ (Target) │
└──────────────┘
The batch job was a single Sidekiq worker that:
- Pulled all unprocessed transactions from the source DB
- Ran fraud scoring (a Ruby gem that made HTTP calls to a Python ML service)
- Updated the ledger
- Exported a reconciliation report
The bottleneck: The fraud scoring step was synchronous and sequential. Each transaction took ~200ms to score. For 2 million transactions, that’s 400,000 seconds of wall time. The math doesn’t lie.
The New Architecture (The Fix)
We didn’t just “optimize the query.” We ripped out the entire pipeline and replaced it with an event-driven, multi-agent system.
Here’s the architecture we landed on:
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
│ PostgreSQL │────▶│ Debezium │────▶│ Apache Kafka │
│ (Source) │ │ (CDC) │ │ (Event Stream) │
└─────────────┘ └──────────────┘ └──────────────────┘
│
▼
┌──────────────────┐
│ ECOA AI Agent │
│ Orchestrator │
└──────────────────┘
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ Fraud │ │ Ledger │ │ Recon │
│ Agent │ │ Agent │ │ Agent │
└────────┘ └────────┘ └────────┘
│ │ │
▼ ▼ ▼
┌──────────────────┐
│ PostgreSQL │
│ (Target) │
└──────────────────┘
Key Components
1. Change Data Capture (CDC) with Debezium
Instead of polling the database every 5 minutes, we used Debezium to stream every new transaction as it hit the source DB. This alone cut the data ingestion latency from minutes to milliseconds.
yaml
# debezium-connector.yaml
{
"name": "transactions-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "source-db.internal",
"database.port": "5432",
"database.user": "debezium",
"database.password": "***",
"database.dbname": "fintech_prod",
"database.server.name": "fintech_source",
"table.include.list": "public.transactions",
"plugin.name": "pgoutput",
"slot.name": "debezium_slot",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false"
}
}
2. Apache Kafka as the Event Backbone
We used a 3-node Kafka cluster (each node with 8 vCPUs, 32GB RAM). The key config change that made a difference: we tuned the producer batch size and linger time.
java
// Producer config for throughput
props.put("batch.size", 262144); // 256KB batches
props.put("linger.ms", 5); // Wait up to 5ms for batching
props.put("compression.type", "snappy"); // 40% smaller on the wire
props.put("acks", "1"); // Leader ack only (we could tolerate rare loss)
This gave us ~8,500 messages/second throughput on the transaction topic. More than enough for their 2 million transactions per night.
3. ECOA AI Agent Orchestration
Here’s where the real magic happened. Instead of a monolithic Sidekiq worker, we decomposed the processing into 3 specialized AI agents orchestrated by the ECOA AI Platform ACP:
- Fraud Scoring Agent: Receives raw transaction events, enriches them with historical data, calls the ML model endpoint, and emits scored events.
- Ledger Update Agent: Consumes scored transactions, applies double-entry accounting rules, and writes to the ledger table.
- Reconciliation Agent: Runs at the end of the batch window, aggregates all processed transactions, and generates the reconciliation report.
Each agent ran as an independent Python service with a dedicated event consumer. The orchestrator managed the DAG of dependencies — the reconciliation agent couldn’t start until the ledger agent had finished all transactions.
The throughput gain: Because the fraud scoring and ledger update agents ran in parallel (within the constraints of per-transaction ordering), we effectively removed the sequential bottleneck. Instead of 200ms per transaction * 2 million, we had 200ms per transaction * (2 million / 8 parallel consumers) = ~50 seconds of wall time for the slowest step.
The Performance Numbers
| Metric | Before | After | Improvement |
|---|---|---|---|
| Total batch time | 4h 17m | 12m 32s | 95.1% faster |
| Data ingestion latency | 5 min (poll) | <100ms (CDC) | 99.97% faster |
| Max throughput | 130 tx/sec | 2,650 tx/sec | 20.4x |
| CPU utilization | 35% (spiky) | 78% (sustained) | 2.2x better |
| Memory usage | 64GB (monolith) | 24GB (distributed) | 62.5% less |
| Error rate | 2.3% | 0.04% | 98.3% fewer errors |
The error rate drop surprised even us. Turns out, when you’re not running a 4-hour job that’s a single point of failure, partial failures don’t cascade into total failures.
The Difficult Lessons
Let me be honest: this wasn’t a smooth ride. We hit three major issues:
1. Event Ordering Hell
Transactions have dependencies. A refund can’t be processed before the original charge. We initially used Kafka’s default partitioning (hash of transaction ID), but refunds and charges with the same ID ended up in different partitions.
Fix: We implemented a custom partitioner that routed all transactions for the same payment method to the same partition. This ensured ordering within a payment method while still allowing parallelism across methods.
java
public class PaymentMethodPartitioner implements Partitioner {
@Override
public int partition(String topic, Object key, byte[] keyBytes,
Object value, byte[] valueBytes, Cluster cluster) {
String transactionJson = new String(valueBytes);
String paymentMethod = extractPaymentMethod(transactionJson);
int partitionCount = cluster.partitionCountForTopic(topic);
return Math.abs(paymentMethod.hashCode()) % partitionCount;
}
}
2. The ECOA AI Agent Cold Start Problem
When we scaled up the fraud scoring agent from 4 to 8 consumers, the new consumers took ~45 seconds to initialize their ML model caches. During that time, they’d reject events.
Fix: We added a pre-warming phase in the ECOA orchestrator. Before scaling, the orchestrator would send a batch of historical transactions to the new consumers to warm their caches. Then it would start feeding them live data.
3. The Reconciliation Race Condition
The reconciliation agent needed to know when *all* transactions from a given batch window had been processed. We initially used a simple “wait for Kafka lag to reach 0” approach. But that was fragile — if a consumer was slow, the reconciliation would start prematurely.
Fix: We implemented a watermark-based approach. Each transaction event included a timestamp. The reconciliation agent tracked the maximum timestamp it had seen from each partition. When all partitions had processed events beyond the batch window boundary, it started reconciliation.
python
# Watermark tracker in the reconciliation agent
class WatermarkTracker:
def __init__(self, partitions):
self.watermarks = {p: 0 for p in partitions}
self.batch_deadline = None
def update_watermark(self, partition, timestamp):
self.watermarks[partition] = max(self.watermarks[partition], timestamp)
if self.batch_deadline and all(w >= self.batch_deadline for w in self.watermarks.values()):
self.trigger_reconciliation()
def set_batch_deadline(self, deadline):
self.batch_deadline = deadline
The Team Behind the Architecture
This wasn’t a solo effort. We had 4 senior engineers from our Can Tho hub working on this:
- 2 backend engineers (Go and Python) — built the Kafka consumers and agents
- 1 data engineer — tuned the CDC pipeline and PostgreSQL performance
- 1 DevOps engineer — handled the Kubernetes deployment and monitoring
The total cost for this team? $12,000/month (4 seniors at $3,000/month each). The client was paying $25,000/month for a single contractor in the US who couldn’t even get the batch job under 3 hours.
More importantly, the Vietnamese team didn’t just implement our spec. They pushed back on the architecture. The lead engineer, Huy, argued against our initial plan of using Redis for event buffering. “Kafka is the right tool here,” he said. “Redis will lose data if we have a crash.” He was right. We changed the design.
That’s the kind of pushback you want from a team. Not “yes, sir” — but “here’s why you’re wrong.”
What We’d Do Differently
If I had to do this again, I’d make two changes:
- Use Kafka Streams instead of raw consumers. We ended up writing a lot of boilerplate for state management and exactly-once semantics. Kafka Streams would have handled that for us.
- Start with the ECOA AI Platform from day one. We initially tried to orchestrate the agents manually with a Python script. It worked for about 3 days, then we hit a deadlock situation that took 6 hours to debug. The ECOA orchestrator’s built-in deadlock detection would have caught it immediately.
The Bottom Line
Batch processing doesn’t have to be slow. It doesn’t have to be a nightly nightmare. If your batch job takes more than 30 minutes, you have an architecture problem — not a data problem.
The fix isn’t harder. It’s different. Event-driven. Distributed. Orchestrated.
And you don’t need a team in San Francisco to build it. You need engineers who understand distributed systems, who aren’t afraid to challenge your assumptions, and who cost a fraction of what you’re paying now.
We found that team in Vietnam. Specifically, in Can Tho. And honestly? I’d do it again in a heartbeat.
—
Frequently Asked Questions
Can this architecture handle real-time processing instead of batch?
Yes. The same event-driven pipeline can process transactions in real-time with minimal changes. You’d just remove the batch window and reconciliation agent, and let the fraud and ledger agents process events as they arrive. We’ve since migrated this client to a hybrid model — real-time fraud scoring with nightly reconciliation.
What’s the minimum transaction volume where this makes sense?
Based on our benchmarks, the event-driven architecture starts beating a monolithic batch job at around 500,000 transactions per run. Below that, the overhead of Kafka and CDC might not be worth it. But if you’re processing millions, the math works.
How did the ECOA AI Platform ACP handle agent failures?
The orchestrator maintained a heartbeat for each agent. If an agent failed, it would restart it and replay the unprocessed events from the Kafka offset. We also configured a circuit breaker — if an agent failed more than 3 times in 5 minutes, the orchestrator would route the events to a dead letter queue for manual inspection.
What’s the cost of running this architecture vs the old monolith?
The new architecture costs about $1,200/month in infrastructure (Kafka cluster, Kubernetes nodes, PostgreSQL). The old monolith cost $800/month. But the team cost dropped from $25,000/month (US contractor) to $12,000/month (Vietnamese team). Net savings: $11,600/month — plus no more 4 AM fire drills.
Related reading: Vietnam Outsourcing: Why Smart CTOs Are Ditching India for Southeast Asia’s Tech Hub
Related reading: Outsourcing Software Development: The Real Playbook for CTOs in 2025