How We Built a Real-Time Data Enrichment Pipeline for a Martech Startup in 3 Weeks — A Vietnam Offshore Case Study
Let me be blunt: most “real-time” pipelines are lies.
They’re batch jobs running on a cron schedule, pretending to be real-time. I’ve seen it a hundred times.
Generative Engine Optimization (GEO): How to Optimize Your Brand for AI Search Engines
Generative Engine Optimization (GEO) is reshaping how brands get discovered in the AI era. This guide explains how… ...
But this time was different. A Martech startup in San Francisco came to us with a real problem. They had 50 million customer records sitting in a mix of PostgreSQL, MongoDB, and S3 buckets. Every time a user visited their site, they needed to enrich that profile with firmographic data, social signals, and behavioral scores.
The old way? A nightly batch job that took 6 hours to complete.
Hire Vietnamese Developers: The Smart Strategy for Scaling Tech Teams in 2025
Educational rigor — Vietnam consistently ranks in the top 5 of the International Math Olympiad. The curriculum emphasizes… ...
By the time the data was fresh, it was already stale.
Here’s the exact architecture we built with a team of 4 senior Vietnamese engineers from our Ho Chi Minh City hub. We cut that 6-hour window to 18 minutes. No exaggeration.
The Problem: Stale Data Was Costing Them Revenue
The startup was losing deals because their sales team was looking at 6-hour-old data.
A prospect would visit their pricing page. The enrichment engine would fire. But the data would arrive 6 hours later, long after the prospect had moved on.
The numbers were brutal:
- 34% of enrichment jobs failed silently due to API rate limits
- 60% of their S3 data was orphaned (no one knew it existed)
- Average enrichment latency: 6 hours
They needed real-time. Not “near real-time.” Real-time.
Why We Chose a Vietnamese Team for This
Look, I could have hired locally in SF for $200k/year per engineer. But that’s not how we work.
We hired from Can Tho and Ho Chi Minh City. Why?
Because Vietnamese engineers don’t just write code. They *optimize*. They’ve been through the wringer with legacy systems. They know what production looks like.
More importantly, they’re cost-effective. Our senior developers cost $3,000/month. That’s a fraction of what you’d pay in the US. And they’re vetted through the ECOA AI Platform ACP, which gives them a 5x efficiency boost.
But enough about the team. Let’s talk about the pipeline.
The Architecture: Kafka + Flink + Redis + PostgreSQL
Here’s what we built:
User Event -> Kafka -> Flink Stream Processor -> Redis Cache -> PostgreSQL (enriched) -> API Response
Wait, that’s too simple. Let me break it down.
We used:
- Kafka as the event backbone (handles 10,000 events/second easily)
- Flink for stream processing (because it’s stateful and handles exactly-once semantics)
- Redis for caching (we cached firmographic lookups to avoid hitting APIs)
- PostgreSQL as the source of truth
The key insight? We didn’t enrich everything.
We enriched *what mattered*.
The Enrichment Logic
python
def enrich_event(event):
# Check cache first
cache_key = f"enrich:{event['domain']}"
cached = redis.get(cache_key)
if cached:
return cached
# If not cached, hit the API
data = fetch_firmographic_data(event['domain'])
if data.get('status') == 'rate_limited':
# Back off and retry
time.sleep(2)
data = fetch_firmographic_data(event['domain'])
# Cache for 1 hour
redis.setex(cache_key, 3600, json.dumps(data))
return data
Simple, right? But it works.
The cache hit rate was 78%. That means 78% of our enrichment requests never hit the external API. We saved thousands of dollars in API costs.
The Results: 18 Minutes vs 6 Hours
Here’s what we tracked:
| Metric | Before | After |
|---|---|---|
| Processing time | 6 hours | 18 minutes |
| API failure rate | 34% | 2% |
| Cache hit rate | 0% | 78% |
| Cost per month | $12,000 | $3,500 |
We cut costs by 70%. That’s not a typo.
The Hard Part: Exactly-Once Semantics
The hardest part was ensuring exactly-once processing. Flink handles this natively, but you need to configure it right.
Here’s the config that saved us:
yaml
pipeline:
checkpointing:
interval: 60000 # 1 minute
mode: EXACTLY_ONCE
state:
backend: rocksdb
ttl: 3600000 # 1 hour
Without this, we’d have duplicate events. And duplicate events mean bad data.
What We Learned
Three things:
- Cache everything you can. We cached firmographic data, social signals, and behavioral scores. The cache hit rate saved us.
- Don’t over-engineer. We started with a simple Kafka -> Flink pipeline. It worked. We added complexity only when needed.
- Vietnamese engineers are fast. Our team in Ho Chi Minh City built this in 3 weeks. A US-based team would have taken 6.
The Real Secret: ECOA AI Platform ACP
Our developers use the ECOA AI Platform ACP. It’s an AI agent orchestration platform that automates 80% of the boilerplate.
Here’s how it works:
- The ACP generates the Kafka producer code
- It handles the Flink stream processing logic
- It even writes the Redis caching layer
Our developers just review the code. They don’t write it from scratch.
That’s how we achieved 5x efficiency.
Frequently Asked Questions
Q: How do you handle data consistency in a real-time pipeline?
A: Use Flink’s exactly-once semantics with checkpointing. We configured checkpointing every 60 seconds. If a node fails, Flink restarts from the last checkpoint. No data loss.
Q: What’s the biggest mistake companies make with real-time enrichment?
A: Not caching. Most companies hit the API for every single event. That’s expensive and slow. Cache firmographic data for 1 hour. You’ll save 70% of API costs.
Q: How do you scale this pipeline?
A: Add more Kafka partitions. Flink scales horizontally. We started with 3 partitions. Now we have 12. Each partition handles 1,000 events/second.
Q: Why hire Vietnamese developers for this?
A: Cost. A senior Flink developer in the US costs $200k/year. In Vietnam, it’s $36k/year. And they’re just as good. We’ve been doing this for 5 years. No quality issues.
Related reading: Vietnam Outsourcing: The Smart CTO’s Playbook for 2025