How We Built a Real-Time Data Enrichment Pipeline for a Martech Startup in 3 Weeks — A Vietnam Offshore Case Study

1 comment
(Developer Tutorials) - A Martech startup needed to enrich 50 million customer records in real-time. We built a pipeline using Kafka, Flink, and a Vietnamese team that slashed processing time from 6 hours to 18 minutes.

How We Built a Real-Time Data Enrichment Pipeline for a Martech Startup in 3 Weeks — A Vietnam Offshore Case Study

Let me be blunt: most “real-time” pipelines are lies.

They’re batch jobs running on a cron schedule, pretending to be real-time. I’ve seen it a hundred times.

Why Vietnam Outsourcing Is the Smartest Move for Your Tech Stack in 2025

Why Vietnam Outsourcing Is the Smartest Move for Your Tech Stack in 2025

TL;DR: Vietnam outsourcing gives you access to a deep pool of skilled software engineers at 60% lower cost… ...

But this time was different. A Martech startup in San Francisco came to us with a real problem. They had 50 million customer records sitting in a mix of PostgreSQL, MongoDB, and S3 buckets. Every time a user visited their site, they needed to enrich that profile with firmographic data, social signals, and behavioral scores.

The old way? A nightly batch job that took 6 hours to complete.

I Scanned 10,000 Open Source Repos on GitHub: The 5 Patterns That Actually Predict Project Survival

I Scanned 10,000 Open Source Repos on GitHub: The 5 Patterns That Actually Predict Project Survival

I Scanned 10,000 Open Source Repos on GitHub: The 5 Patterns That Actually Predict Project Survival Ever looked… ...

By the time the data was fresh, it was already stale.

Here’s the exact architecture we built with a team of 4 senior Vietnamese engineers from our Ho Chi Minh City hub. We cut that 6-hour window to 18 minutes. No exaggeration.

The Problem: Stale Data Was Costing Them Revenue

The startup was losing deals because their sales team was looking at 6-hour-old data.

A prospect would visit their pricing page. The enrichment engine would fire. But the data would arrive 6 hours later, long after the prospect had moved on.

The numbers were brutal:

  • 34% of enrichment jobs failed silently due to API rate limits
  • 60% of their S3 data was orphaned (no one knew it existed)
  • Average enrichment latency: 6 hours

They needed real-time. Not “near real-time.” Real-time.

Why We Chose a Vietnamese Team for This

Look, I could have hired locally in SF for $200k/year per engineer. But that’s not how we work.

We hired from Can Tho and Ho Chi Minh City. Why?

Because Vietnamese engineers don’t just write code. They *optimize*. They’ve been through the wringer with legacy systems. They know what production looks like.

More importantly, they’re cost-effective. Our senior developers cost $3,000/month. That’s a fraction of what you’d pay in the US. And they’re vetted through the ECOA AI Platform ACP, which gives them a 5x efficiency boost.

But enough about the team. Let’s talk about the pipeline.

The Architecture: Kafka + Flink + Redis + PostgreSQL

Here’s what we built:


User Event -> Kafka -> Flink Stream Processor -> Redis Cache -> PostgreSQL (enriched) -> API Response

Wait, that’s too simple. Let me break it down.

We used:

  • Kafka as the event backbone (handles 10,000 events/second easily)
  • Flink for stream processing (because it’s stateful and handles exactly-once semantics)
  • Redis for caching (we cached firmographic lookups to avoid hitting APIs)
  • PostgreSQL as the source of truth

The key insight? We didn’t enrich everything.

We enriched *what mattered*.

The Enrichment Logic

python
def enrich_event(event):
    # Check cache first
    cache_key = f"enrich:{event['domain']}"
    cached = redis.get(cache_key)
    if cached:
        return cached
    
    # If not cached, hit the API
    data = fetch_firmographic_data(event['domain'])
    if data.get('status') == 'rate_limited':
        # Back off and retry
        time.sleep(2)
        data = fetch_firmographic_data(event['domain'])
    
    # Cache for 1 hour
    redis.setex(cache_key, 3600, json.dumps(data))
    
    return data

Simple, right? But it works.

The cache hit rate was 78%. That means 78% of our enrichment requests never hit the external API. We saved thousands of dollars in API costs.

The Results: 18 Minutes vs 6 Hours

Here’s what we tracked:

Metric Before After
Processing time 6 hours 18 minutes
API failure rate 34% 2%
Cache hit rate 0% 78%
Cost per month $12,000 $3,500

We cut costs by 70%. That’s not a typo.

The Hard Part: Exactly-Once Semantics

The hardest part was ensuring exactly-once processing. Flink handles this natively, but you need to configure it right.

Here’s the config that saved us:

yaml
pipeline:
  checkpointing:
    interval: 60000  # 1 minute
    mode: EXACTLY_ONCE
  state:
    backend: rocksdb
    ttl: 3600000  # 1 hour

Without this, we’d have duplicate events. And duplicate events mean bad data.

What We Learned

Three things:

  1. Cache everything you can. We cached firmographic data, social signals, and behavioral scores. The cache hit rate saved us.
  1. Don’t over-engineer. We started with a simple Kafka -> Flink pipeline. It worked. We added complexity only when needed.
  1. Vietnamese engineers are fast. Our team in Ho Chi Minh City built this in 3 weeks. A US-based team would have taken 6.

The Real Secret: ECOA AI Platform ACP

Our developers use the ECOA AI Platform ACP. It’s an AI agent orchestration platform that automates 80% of the boilerplate.

Here’s how it works:

  • The ACP generates the Kafka producer code
  • It handles the Flink stream processing logic
  • It even writes the Redis caching layer

Our developers just review the code. They don’t write it from scratch.

That’s how we achieved 5x efficiency.

Frequently Asked Questions

Q: How do you handle data consistency in a real-time pipeline?

A: Use Flink’s exactly-once semantics with checkpointing. We configured checkpointing every 60 seconds. If a node fails, Flink restarts from the last checkpoint. No data loss.

Q: What’s the biggest mistake companies make with real-time enrichment?

A: Not caching. Most companies hit the API for every single event. That’s expensive and slow. Cache firmographic data for 1 hour. You’ll save 70% of API costs.

Q: How do you scale this pipeline?

A: Add more Kafka partitions. Flink scales horizontally. We started with 3 partitions. Now we have 12. Each partition handles 1,000 events/second.

Q: Why hire Vietnamese developers for this?

A: Cost. A senior Flink developer in the US costs $200k/year. In Vietnam, it’s $36k/year. And they’re just as good. We’ve been doing this for 5 years. No quality issues.

Related reading: Vietnam Outsourcing: The Smart CTO’s Playbook for 2025

Related reading: Outsourcing software development in 2025: A CTO’s playbook for smart execution

Leave a Comment

Your email address will not be published. Required fields are marked *

Ready to Build with AI-Powered Developers?

Hire Vietnamese engineers augmented by ECOA AI Platform + Claude Code. 5x faster, 40% cheaper.