We Migrated a 10TB Kafka Cluster Without a Single Message Lost: What We Learned With a Vietnam-Based Team

Let me be blunt: migrating a 10TB Kafka cluster is terrifying.

You’re not just moving data. You’re moving the nervous system of your entire event-driven architecture. One misconfigured consumer offset, one lag spike during cutover, and you’re looking at hours of reprocessing—or worse, permanent data loss.

Why I Ditched GitHub Copilot for Open Source AI (And You Should Too)

TL;DR: GitHub Copilot is great, but it’s not the only option. Open source AI coding assistants like CodeLlama,… ...

We did it. Zero messages lost. Zero downtime. Here’s exactly how.

The Nightmare Scenario

Our client was a US-based fintech processing over 2 million financial transactions daily. Their Kafka cluster had been running for 3 years on an outdated Confluent version with a single-rack deployment in a data center that was being decommissioned.

Why Smart CTOs Hire Vietnamese Developers Over Other Offshore Hubs

TL;DR: Vietnam is producing world-class engineers at a fraction of Silicon Valley rates. As a Vietnam-based firm, ECOA… ...

The constraints were brutal:

10TB of data across 120 topics and 1,500 partitions
99.995% uptime SLA — no maintenance windows longer than 30 seconds
Exactly-once semantics required for payment processing topics
No schema registry downtime allowed during migration

Most consultants told them to just rebuild consumers and replay from source. That would have taken 6 weeks and risked a 4-hour data gap.

We said no. We’d do a live migration.

Why We Chose a Vietnamese Team for This

This wasn’t a “let’s save money” decision. It was a “let’s not screw this up” decision.

We needed engineers who could work overlapping hours with the US client’s SRE team, communicate clearly in English, and had deep Kafka production experience. Not “I read the docs” experience. Real, “I’ve debugged a stuck consumer group at 3 AM” experience.

Our team in Ho Chi Minh City had exactly that. Two senior engineers with 5+ years each running Kafka in production. One middle engineer who specialized in Kafka Connect and schema management.

The cost? $6,000/month for the entire team. A US-based equivalent would have run $45,000/month minimum.

But honestly, the cost wasn’t the point. The skill was.

The Architecture: MirrorMaker 2.0 with a Twist

Here’s the standard approach most teams try:

Set up MirrorMaker 2.0
Let it replicate for a few days
Flip consumers to the new cluster
Pray

That fails because consumer offsets don’t migrate cleanly, schema compatibility breaks, and you end up with duplicate or missing messages.

We built a three-phase approach:

Phase 1: Baseline Replication (Days 1-3)

We deployed MirrorMaker 2.0 in active-active mode between the old and new clusters. But we didn’t just mirror topics—we mirrored consumer group offsets using a custom script.

python
# Simplified offset migration script
from confluent_kafka import Consumer, Producer
import json

old_cluster = {'bootstrap.servers': 'old-cluster:9092'}
new_cluster = {'bootstrap.servers': 'new-cluster:9092'}

consumer = Consumer({
    **old_cluster,
    'group.id': 'offset-migrator',
    'enable.auto.commit': False
})

producer = Producer(new_cluster)

# Get all consumer groups
metadata = consumer.list_topics()
groups = ['payment-processor', 'notification-sender', 'analytics-pipeline']

for group in groups:
    # Get committed offsets from old cluster
    old_consumer = Consumer({**old_cluster, 'group.id': group, 'enable.auto.commit': False})
    partitions = old_consumer.list_topics().topics['transactions'].partitions
    
    for partition_id in partitions:
        watermark = old_consumer.get_watermark_offsets(
            TopicPartition('transactions', partition_id)
        )
        committed = old_consumer.committed(
            [TopicPartition('transactions', partition_id)]
        )
        
        # Commit to new cluster
        new_consumer = Consumer({**new_cluster, 'group.id': group, 'enable.auto.commit': False})
        new_consumer.commit(
            offsets=[TopicPartition('transactions', partition_id, committed[0].offset)]
        )
        new_consumer.close()
    
    old_consumer.close()

This ensured that when we flipped consumers, they’d pick up exactly where they left off.

Phase 2: Validation Pipeline (Days 4-6)

This is where most teams fail. They assume MirrorMaker is perfect. It’s not.

We built a validation pipeline that compared message counts and checksums between clusters every 5 minutes.

python
# Validation script (simplified)
def validate_topic(topic, partition, start_offset, end_offset):
    old_messages = []
    new_messages = []
    
    for offset in range(start_offset, end_offset, 1000):
        old_batch = consume_batch('old-cluster:9092', topic, partition, offset, 1000)
        new_batch = consume_batch('new-cluster:9092', topic, partition, offset, 1000)
        
        old_messages.extend(old_batch)
        new_messages.extend(new_batch)
    
    # Compare checksums
    old_hash = hashlib.md5(b''.join(m.value() for m in old_messages)).hexdigest()
    new_hash = hashlib.md5(b''.join(m.value() for m in new_messages)).hexdigest()
    
    return old_hash == new_hash

We found 3 topics where MirrorMaker had silently dropped messages due to schema incompatibility. Our team caught it before cutover.

Phase 3: The Cutover (Day 7, 2 AM UTC)

The actual cutover took 47 seconds.

Here’s the exact sequence:

T+0s: Stop all producers writing to old cluster
T+5s: Flush MirrorMaker’s pending queue
T+10s: Validate final offset parity
T+15s: Update DNS records to point to new cluster
T+20s: Restart consumers with new cluster config
T+47s: All consumers healthy, lag at zero

The key insight? We didn’t flip everything at once. We migrated topic by topic, starting with non-critical analytics topics, then notification topics, and finally payment processing topics.

What Actually Went Wrong (And How We Fixed It)

I’m not going to pretend it was perfect. Three things broke:

1. Schema Registry Timeout

The new Schema Registry was under-provisioned. During cutover, it hit 100% CPU and started timing out. Our team had already provisioned a read replica, so we failed over in 90 seconds.

2. Consumer Group Rebalancing

One consumer group with 50 members took 4 minutes to rebalance after the cutover. We’d set `session.timeout.ms` too high. Hotfix: reduced it from 45s to 10s for the migration window.

3. Lag Spike on High-Volume Topic

The `transactions` topic (2GB/hour) showed a 30-second lag spike after cutover. Turned out the new cluster’s `num.network.threads` was set to 3 instead of 8. Quick config change fixed it.

The Results

Metric	Before	After
Cluster size	10TB	10TB (identical)
Messages lost	0	0
Downtime	0	0
Consumer lag at cutover	0	0
Migration time	N/A	7 days
Team cost	$45k/mo (US)	$6k/mo (Vietnam)

Why This Worked

Three reasons:

1. We treated it like a production deployment, not a migration.

Every step had a rollback plan. Every script had a dry-run mode. We rehearsed the cutover 3 times in staging.

2. The Vietnamese team owned the execution.

They didn’t just follow instructions. They found the schema incompatibility bug. They suggested the topic-by-topic cutover strategy. They built the validation pipeline.

3. We used the ECOA AI Platform for orchestration.

We automated the validation checks, the cutover sequence, and the rollback triggers using ACP. It reduced manual steps from 23 to 4.

Frequently Asked Questions

Q: Can MirrorMaker 2.0 handle 10TB without data loss?

Yes, but only if you validate. MirrorMaker is reliable for most topics, but schema changes, network partitions, and configuration mismatches can cause silent drops. Always build a checksum-based validation pipeline.

Q: How long does a Kafka cluster migration actually take?

For a 10TB cluster with 120 topics, plan for 7-10 days. The replication itself takes 2-3 days. The remaining time is for validation, testing, and the actual cutover. Don’t rush the validation phase.

Q: What’s the biggest risk in Kafka migration?

Consumer offset drift. If your consumers don’t pick up exactly where they left off, you’ll either miss messages or reprocess duplicates. Always migrate offsets explicitly—don’t rely on auto-commit.

Q: How much does a Vietnam-based Kafka engineering team cost?

For a migration like this, expect $5,000-$8,000/month for a team of 3 senior engineers. That’s 80-90% less than US rates, with the same or better quality if you vet properly.