How We Fixed an Event Pipeline That Lost 30% of Messages (And Why Most Devs Blame Kafka)

The client was a Series B fintech company in Singapore. They processed around 12 million events per day—transactions, fraud alerts, user activity pings. Their platform was built on Kafka, PostgreSQL, and a bunch of Python microservices.

Then the alerts started.

Why Vietnam Outsourcing Beats Other Offshore Destinations in 2025 | ECOA AI

TL;DR: Vietnam outsourcing delivers 30–50% cost savings compared to Western rates, with English proficiency rising fast, a 13-hour… ...

“We’re losing events,” the CTO told me on a Thursday call. “Our reconciliation reports show a gap. Almost 30% of our webhook receipts never make it to the data lake.”

Sound familiar? Your first instinct is almost always “Kafka issue.” Lagging consumers? Rebalancing partitions? Corrupted offsets?

Your Open Source PRs Are Getting Rejected: Here’s the Exact Data on Why (And How to Fix It)

Your Open Source PRs Are Getting Rejected: Here’s the Exact Data on Why (And How to Fix It)… ...

Wrong.

We bet three weeks and an ECOA AI-augmented Vietnamese team in Can Tho on a different theory. That bet paid off in 8 days.

Here’s exactly how it went down.

The False Start: Everyone Looked at Kafka First

The client had already spent two weeks with their in-house SRE team. They’d:

Increased partition count from 6 to 12
Bumped retention.ms from 7 days to 14 days
Scaled consumer instances from 3 to 8
Enabled idempotent producers

Message loss? Still 30%.

The frustration was palpable. One engineer even posted in their team Slack: “Kafka is losing our data. We need to migrate to Pulsar.”

But here’s the thing—Kafka doesn’t silently drop 30% of messages. It’s durable by design. If you’re losing that volume, the problem is almost never the broker.

It’s the consumer.

What the Vietnamese Team Actually Found

We assembled a small squad: two middle-level Python engineers from our Can Tho hub, one senior dev lead based in Ho Chi Minh City, and the client’s lead data architect. We used the ECOA AI Platform ACP to orchestrate the debugging workflow—spinning up parallel analysis agents for consumer logs, producer traces, and schema validation.

Day 1: The senior dev ran a simple offset lag query and found nothing abnormal.

Day 2: One of the middle engineers noticed something odd. The consumer group was committing offsets correctly, but the logs showed `DeserializationError` exceptions being swallowed.

She traced it to a single line in the consumer codebase:

python
try:
    message = json.loads(raw_value.value())
    process_event(message)
    consumer.commit()
except Exception:
    pass

Yes. A blanket `except Exception: pass`.

But that wasn’t the full story. That catch would log a warning, but here’s what we found next.

The Real Culprit: Schema Evolution + Stale Cache

The client had recently updated their Avro schema. They’d added two optional fields to the event payload. The serialization library on the producer side (Confluent’s `avro.Serializer`) encoded the new schema version correctly.

The consumer, however, had a cached schema registry client that expired after 1 hour. New events with the updated schema arrived within minutes of the rollout. The consumer tried to deserialize using the old schema, failed silently, skipped the message, and committed the offset anyway.

That’s why Kafka looked fine. The broker did its job. The consumer just decided to pretend the message never existed.

We lost 30% of events for 8 hours before anyone noticed.

Honestly, I’ve seen worse. A blanket `except: pass` in a data pipeline is like leaving your front door unlocked and blaming the neighbors.

The Fix: 47 Lines, Zero Cluster Changes

We didn’t touch a single Kafka configuration. Here’s what we deployed:

Schema registry client with forced re-validation on deserialization failure. No more stale cache trust.

Dead-letter queue for failed messages. Sent them to a separate `events_dead_letter` topic with the raw bytes and the error trace.

Alert on any deserialization error. We added a simple Prometheus counter. If it increments, PagerDuty fires.

The core consumer change looked like this:

python
from confluent_kafka import DeserializationError

def safe_process_message(raw_value, topic):
    try:
        message = avro_deserializer(raw_value, reader_schema_schema_id=LATEST)
        process_event(message)
        consumer.commit()
    except DeserializationError as e:
        dead_letter_producer.produce(
            topic="events_dead_letter",
            value=raw_value,
            headers={"error": str(e), "original_topic": topic}
        )
        consumer.commit()
        metrics["deserialization_errors"].inc()

That’s the entire structural change. 47 lines across two files.

But wait—we also found a second issue.

The Silent Bottleneck That Was Masquerading as Latency

While debugging the consumer, the ECOA AI orchestration agent flagged something else. The producer side had a batch configuration that was too aggressive.

python
producer_config = {
    "bootstrap.servers": "kafka-cluster:9092",
    "acks": "all",
    "batch.size": 1048576,    # 1MB
    "linger.ms": 5000         # 5 seconds!
}

A `linger.ms` of 5000 meant the producer held events in memory for up to 5 seconds before sending them. The client assumed this improved throughput. In reality, it introduced variable latency spikes that made downstream systems time out and re-send events.

The retry storm exacerbated the deserialization issue. Basically, the system was fighting itself.

We dropped linger.ms to 100. Batch size stayed at 1MB.

Result: Latency dropped from an average of 1.8 seconds to 680ms. P99 went from 4.2 seconds down to 1.1 seconds.

The Numbers After 30 Days of Production

Metric	Before	After	Improvement
Message loss rate	30.2%	0.004%	99.98% reliability
P99 event processing latency	4.2s	1.1s	-73%
Dead-letter events handled	0	~120/day	Proper error recovery
Engineer hours spent on debugging/week	18h	2h	-89%

The 0.004% loss we still see is from real Kafka broker failures (network partitions, disk full). Those are acceptable. We trap them in the DLQ and replay on recovery.

Why This Matters for Your Own Pipeline

Most teams obsess over infrastructure. They scale brokers, add partitions, tune `num.network.threads`. But the data loss happens at the edges—in the serialization layer, the consumer logic, the error handling that says “it’s fine” when it’s not.

Here are the three lessons we take to every pipeline audit now:

Never blanket-except in a consumer. Ever. Catch specific errors, log everything, and publish to a DLQ.

Schema registry caching is a ticking bomb. Force re-fetch on deserialization failure or implement a proactive schema sync.

Watch your producer sidecar configs. `linger.ms` and `batch.size` are not “set and forget” values. They need to align with your actual event velocity and consumer throughput.

The ECOA AI orchestration platform made this debugging loop 3x faster. Instead of manually grepping logs across 12 services, we defined a workflow: “trace consumer errors -> check schema registry -> flag producer config anomalies.” The AI agents executed that in parallel and surfaced the schema mismatch and the linger.ms issue within hours, not days.

Frequently Asked Questions

Q: Is Kafka actually reliable for high-volume event pipelines?

Yes, Kafka is extremely reliable. The broker infrastructure handles replication, persistence, and ordering well. Most message loss problems are in the consumer or producer libraries, not Kafka itself. Always audit your client-side code and schema handling before blaming the cluster.

Q: Should I always use a dead-letter queue?

Absolutely. A DLQ is not optional for production pipelines. Without it, a single bad message can block your entire consumer group or, worse, get silently dropped. Route deserialization failures, poison pills, and timeout errors to a DLQ topic. Replay them after fixing the root cause.

Q: How did the ECOA AI platform specifically help here?

The platform orchestrated parallel analysis workflows. One agent checked Kafka broker metrics, another parsed consumer logs for exception patterns, a third analyzed producer configurations. Instead of one engineer debugging sequentially for days, we got correlated findings in under 4 hours. The schema mismatch was flagged by the log analysis agent before we even looked at the consumer code.

Q: What’s the cost breakdown for a fix like this with a Vietnamese team?

We deployed one senior lead ($3,000/month) and two middle engineers ($2,000/month each) from our Can Tho hub. Total engineering cost for the 8-day engagement was roughly $1,700. Compare that to the client’s internal SRE team spending two weeks (over $12,000 in salaries) chasing Kafka configurations. The ROI is obvious once you stop guessing and start orchestrating.