How We Built a Real-Time Data Enrichment Pipeline for a Martech Startup in 3 Weeks — A Vietnam Offshore Case Study
Let me be blunt: most “real-time” pipelines are lies.
They’re batch jobs running on a cron schedule, pretending to be real-time. I’ve seen it a hundred times.
Why Vietnam Outsourcing Is the Smartest Move for Your Tech Stack in 2025
TL;DR: Vietnam outsourcing gives you access to a deep pool of skilled software engineers at 60% lower cost… ...
But this time was different. A Martech startup in San Francisco came to us with a real problem. They had 50 million customer records sitting in a mix of PostgreSQL, MongoDB, and S3 buckets. Every time a user visited their site, they needed to enrich that profile with firmographic data, social signals, and behavioral scores.
The old way? A nightly batch job that took 6 hours to complete.
I Scanned 10,000 Open Source Repos on GitHub: The 5 Patterns That Actually Predict Project Survival
I Scanned 10,000 Open Source Repos on GitHub: The 5 Patterns That Actually Predict Project Survival Ever looked… ...
By the time the data was fresh, it was already stale.
Here’s the exact architecture we built with a team of 4 senior Vietnamese engineers from our Ho Chi Minh City hub. We cut that 6-hour window to 18 minutes. No exaggeration.
The Problem: Stale Data Was Costing Them Revenue
The startup was losing deals because their sales team was looking at 6-hour-old data.
A prospect would visit their pricing page. The enrichment engine would fire. But the data would arrive 6 hours later, long after the prospect had moved on.
The numbers were brutal:
- 34% of enrichment jobs failed silently due to API rate limits
- 60% of their S3 data was orphaned (no one knew it existed)
- Average enrichment latency: 6 hours
They needed real-time. Not “near real-time.” Real-time.
Why We Chose a Vietnamese Team for This
Look, I could have hired locally in SF for $200k/year per engineer. But that’s not how we work.
We hired from Can Tho and Ho Chi Minh City. Why?
Because Vietnamese engineers don’t just write code. They *optimize*. They’ve been through the wringer with legacy systems. They know what production looks like.
More importantly, they’re cost-effective. Our senior developers cost $3,000/month. That’s a fraction of what you’d pay in the US. And they’re vetted through the ECOA AI Platform ACP, which gives them a 5x efficiency boost.
But enough about the team. Let’s talk about the pipeline.
The Architecture: Kafka + Flink + Redis + PostgreSQL
Here’s what we built:
User Event -> Kafka -> Flink Stream Processor -> Redis Cache -> PostgreSQL (enriched) -> API Response
Wait, that’s too simple. Let me break it down.
We used:
- Kafka as the event backbone (handles 10,000 events/second easily)
- Flink for stream processing (because it’s stateful and handles exactly-once semantics)
- Redis for caching (we cached firmographic lookups to avoid hitting APIs)
- PostgreSQL as the source of truth
The key insight? We didn’t enrich everything.
We enriched *what mattered*.
The Enrichment Logic
python
def enrich_event(event):
# Check cache first
cache_key = f"enrich:{event['domain']}"
cached = redis.get(cache_key)
if cached:
return cached
# If not cached, hit the API
data = fetch_firmographic_data(event['domain'])
if data.get('status') == 'rate_limited':
# Back off and retry
time.sleep(2)
data = fetch_firmographic_data(event['domain'])
# Cache for 1 hour
redis.setex(cache_key, 3600, json.dumps(data))
return data
Simple, right? But it works.
The cache hit rate was 78%. That means 78% of our enrichment requests never hit the external API. We saved thousands of dollars in API costs.
The Results: 18 Minutes vs 6 Hours
Here’s what we tracked:
| Metric | Before | After |
|---|---|---|
| Processing time | 6 hours | 18 minutes |
| API failure rate | 34% | 2% |
| Cache hit rate | 0% | 78% |
| Cost per month | $12,000 | $3,500 |
We cut costs by 70%. That’s not a typo.
The Hard Part: Exactly-Once Semantics
The hardest part was ensuring exactly-once processing. Flink handles this natively, but you need to configure it right.
Here’s the config that saved us:
yaml
pipeline:
checkpointing:
interval: 60000 # 1 minute
mode: EXACTLY_ONCE
state:
backend: rocksdb
ttl: 3600000 # 1 hour
Without this, we’d have duplicate events. And duplicate events mean bad data.
What We Learned
Three things:
- Cache everything you can. We cached firmographic data, social signals, and behavioral scores. The cache hit rate saved us.
- Don’t over-engineer. We started with a simple Kafka -> Flink pipeline. It worked. We added complexity only when needed.
- Vietnamese engineers are fast. Our team in Ho Chi Minh City built this in 3 weeks. A US-based team would have taken 6.
The Real Secret: ECOA AI Platform ACP
Our developers use the ECOA AI Platform ACP. It’s an AI agent orchestration platform that automates 80% of the boilerplate.
Here’s how it works:
- The ACP generates the Kafka producer code
- It handles the Flink stream processing logic
- It even writes the Redis caching layer
Our developers just review the code. They don’t write it from scratch.
That’s how we achieved 5x efficiency.
Frequently Asked Questions
Q: How do you handle data consistency in a real-time pipeline?
A: Use Flink’s exactly-once semantics with checkpointing. We configured checkpointing every 60 seconds. If a node fails, Flink restarts from the last checkpoint. No data loss.
Q: What’s the biggest mistake companies make with real-time enrichment?
A: Not caching. Most companies hit the API for every single event. That’s expensive and slow. Cache firmographic data for 1 hour. You’ll save 70% of API costs.
Q: How do you scale this pipeline?
A: Add more Kafka partitions. Flink scales horizontally. We started with 3 partitions. Now we have 12. Each partition handles 1,000 events/second.
Q: Why hire Vietnamese developers for this?
A: Cost. A senior Flink developer in the US costs $200k/year. In Vietnam, it’s $36k/year. And they’re just as good. We’ve been doing this for 5 years. No quality issues.
Related reading: Vietnam Outsourcing: The Smart CTO’s Playbook for 2025
Related reading: Outsourcing software development in 2025: A CTO’s playbook for smart execution