Multi-Agent Systems Love Duplicate Work: Why Idempotency Is the Silent Killer of Production Workflows (And How to Fix It)
I’ll say it bluntly: your multi-agent system is running duplicate tasks right now. You just don’t know it yet.
We discovered this the hard way. We were running a financial data pipeline with five specialized agents – one for ingestion, one for normalization, one for enrichment, one for validation, and one for storage. The orchestrator dispatched tasks, agents completed them, life was good.
How ECOA AI Platform AI Agent Orchestration Transforms Development Teams
ECOA AI Platform AI is revolutionizing how development teams operate. By orchestrating multiple AI agents (Claude Code, Codex,… ...
Then we checked the database. 17% of records were duplicated. Some were processed three times.
The orchestrator wasn’t buggy. The agents weren’t broken. The problem was simpler and uglier: retries. A network hiccup here, a timeout there, a DB connection drop. Each failure triggered a retry. Each retry re-ran the entire agent workflow on the same data.
Stop Hallucinations: 7 Battle-Tested RAG Techniques That Actually Work in Production
Stop Hallucinations: 7 Battle-Tested RAG Techniques That Actually Work in Production Everyone loves RAG. Everyone *also* has a… ...
Multi-agent systems amplify this problem because they multiply the points of failure. More agents = more retries = more duplicates.
Here’s how we fixed it. You’ll want to steal this.
Why Idempotency Matters More in Multi-Agent Than Anywhere Else
Single-agent systems have it easy. One request, one response, one retry. Worst case, you reprocess one thing.
Multi-agent systems are a different beast. An orchestrator dispatches tasks to Agent A, which calls Agent B, which calls Agent C. If Agent C fails and retries, Agent B might already have consumed and forwarded the result. Now Agent C gets the same payload twice, processes it twice, and emits two outputs. Agent B consumes both. The pipeline doubles its output silently.
This isn’t theoretical. It happens constantly.
The fix isn’t “don’t retry.” Retries are essential for reliability. The fix is idempotency at every agent boundary.
The Core Pattern: A Dedup Layer at the Orchestrator Level
We built a lightweight idempotency layer that sits between the orchestrator and every agent. Here’s the architecture:
Orchestrator → Idempotency Layer → Agent A
↓
Redis + PostgreSQL
Every task that enters the system gets a unique idempotency key – a deterministic hash of the input payload plus a namespace. Before an agent executes, the layer checks if this key has already been processed.
Simple, right? The devil is in the expiration, the isolation, and the failure modes.
Step 1: Generate Deterministic Idempotency Keys
Don’t use UUIDs. A random UUID means every retry generates a different key, and your dedup layer becomes useless.
Use a hash of the actual input data:
python
import hashlib
import json
def generate_idempotency_key(namespace: str, payload: dict) -> str:
"""
Generate a deterministic key from the payload.
Namespace prevents collisions between different workflow types.
"""
raw = f"{namespace}:{json.dumps(payload, sort_keys=True)}"
return hashlib.sha256(raw.encode()).hexdigest()
Sort keys in the JSON. Otherwise, `{“a”: 1, “b”: 2}` and `{“b”: 2, “a”: 1}` generate different hashes for the same logical data.
Step 2: Redis as the First Line of Defense
Redis is fast. It’s perfect for the hot path. But it’s volatile. Don’t rely on it alone.
python
import redis.asyncio as aioredis
class IdempotencyGuard:
def __init__(self, redis_client: aioredis.Redis, ttl_seconds: int = 3600):
self.redis = redis_client
self.ttl = ttl_seconds
async def is_already_processed(self, key: str) -> bool:
"""Check Redis. Returns True if this key was already seen."""
return await self.redis.exists(f"idempotency:{key}")
async def mark_as_processing(self, key: str) -> bool:
"""
Atomically set the key with NX (not exists).
Returns True if this call acquired the lock (first to process).
Returns False if another agent already claimed it.
"""
result = await self.redis.set(
f"idempotency:{key}",
"processing",
nx=True, # Only set if key doesn't exist
ex=self.ttl
)
return result is True
async def mark_as_completed(self, key: str, result: str):
"""Mark the key as completed with the actual result."""
await self.redis.set(
f"idempotency:{key}",
f"completed:{result}",
ex=self.ttl
)
The `NX` flag is critical. It’s the atomic check-and-set operation that prevents race conditions when two retries arrive at the same millisecond.
Step 3: PostgreSQL as the Source of Truth
Redis can lose data on restart. For workflows that matter (financial transactions, order processing, document generation), you need durable storage.
sql
CREATE TABLE idempotency_keys (
key_hash VARCHAR(64) PRIMARY KEY,
namespace VARCHAR(128) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending',
input_payload JSONB NOT NULL,
output_payload JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ,
retry_count INT NOT NULL DEFAULT 0
);
CREATE INDEX idx_idempotency_namespace_status
ON idempotency_keys(namespace, status);
Use `INSERT … ON CONFLICT DO NOTHING` for the atomic claim:
python
async def claim_task_in_postgres(pool, key_hash: str, namespace: str, payload: dict) -> bool:
async with pool.acquire() as conn:
result = await conn.execute("""
INSERT INTO idempotency_keys (key_hash, namespace, status, input_payload)
VALUES ($1, $2, 'processing', $3::jsonb)
ON CONFLICT (key_hash) DO NOTHING
""")
# Returns True if we inserted (first to claim)
return result == "INSERT 0 1"
Step 4: Combine Redis and PostgreSQL in a Two-Phase Check
Don’t query Postgres on every task. That’s slow. Use Redis as a hot cache, Postgres as cold storage.
python
async def guard_task(self, key: str, namespace: str, payload: dict) -> bool:
"""
Two-phase idempotency check.
Returns True if this task should execute (first of its kind).
"""
# Phase 1: Fast Redis check
if await self.redis.exists(f"idempotency:{key}"):
return False
# Phase 2: Check Postgres (handles Redis eviction/restart)
if await self.pg_was_processed(key):
# Backfill Redis for future fast checks
await self.redis.set(f"idempotency:{key}", "completed", ex=self.ttl)
return False
# Phase 3: Claim in Postgres (atomic)
claimed = await self.pg_claim(key, namespace, payload)
if not claimed:
return False
# Phase 4: Mark in Redis
await self.redis.set(f"idempotency:{key}", "processing", ex=self.ttl)
return True
Honestly, this pattern eliminated 100% of our duplicate processing issues. We went from 17% duplication to zero. Measurable. Repeatable.
Handling the Stuck-Processing Edge Case
Here’s a problem you’ll hit: an agent claims a task, starts processing, then crashes. The idempotency key stays in “processing” state forever. Subsequent retries see “processing” and refuse to touch it. The task is dead.
You need a timeout-based release mechanism.
python
async def release_stuck_tasks(self, timeout_seconds: int = 300):
"""Release tasks that have been 'processing' longer than the timeout."""
async with self.pool.acquire() as conn:
await conn.execute("""
UPDATE idempotency_keys
SET status = 'pending', retry_count = retry_count + 1
WHERE status = 'processing'
AND created_at < NOW() - $1::interval
AND retry_count < 3
""", f"{timeout_seconds} seconds")
Set the Redis TTL equal to your processing timeout. If Redis evicts the key and the task is still processing, the next retry goes to Postgres, sees "processing", and backs off. The background job above releases it after the timeout.
What About Agents That Call Other Agents?
This is where most idempotency implementations fail. They only protect the outermost orchestrator call. But in a multi-agent system, Agent A calls Agent B, which calls Agent C.
Every agent boundary needs its own idempotency check.
We solved this by passing the parent idempotency key down the call chain as a `trace_id`. Each agent generates child keys by hashing `parent_key + agent_name + input_payload`:
python
def child_key(parent_key: str, agent_name: str, payload: dict) -> str:
raw = f"{parent_key}:{agent_name}:{json.dumps(payload, sort_keys=True)}"
return hashlib.sha256(raw.encode()).hexdigest()
This gives you a full lineage of every agent execution. When something goes wrong, you can trace the exact chain of calls that produced a duplicate.
The Performance Impact (Real Numbers)
People worry about overhead. Let me share our production metrics:
- Redis check: ~0.3ms per task
- Postgres check (only on cache miss, ~2% of requests): ~3ms
- Postgres insert (first time only): ~5ms
- Total overhead per task: ~0.4ms average
For a pipeline processing 1,000 tasks per second, that's 400ms of overhead total. Worth every microsecond.
We run this in production across 10 agent types handling 50,000+ tasks daily. Zero duplicates since deployment.
A Note on Our Vietnam Engineering Team
We built this idempotency layer with our team based in Can Tho, Vietnam. Honestly, the clarity they brought to the design was exceptional. The senior engineer who led this recognized the duplicate problem after two days of observing our staging environment — something our on-site team in the US had normalized as "occasional data drift."
That's the kind of engineering depth you get when you're not just looking for cheap rates. You're looking for people who have debugged distributed systems before. Our team at ECOA AI leverages the ACP orchestration platform to build these patterns rapidly. A junior developer on the platform handles the Redis integration. A middle developer architects the Postgres schema and two-phase check. A senior developer oversees the distributed tracing and edge cases.
At $1,000/month for junior, $2,000/month for middle, and $3,000/month for senior developers, the idempotency layer cost us roughly $4,000 in engineering time. A US-based agency quoted $18,000 for the same scope.
The Hard Truth
Idempotency isn't glamorous. It's plumbing. But in multi-agent systems, bad plumbing floods the basement with duplicate data.
Don't assume your orchestrator handles this. Test it. Intentionally trigger retries and measure duplication. I promise you'll find double processing.
Build the idempotency layer before you need it. Your future self — and your production database — will thank you.
---
Frequently Asked Questions
What's the difference between idempotency and deduplication in multi-agent systems?
Idempotency guarantees that processing the same input multiple times produces the same result. Deduplication removes duplicate entries after the fact. Idempotency is preventive — you stop duplicates before they happen. Deduplication is reactive — you clean them up after. In multi-agent systems, always prefer idempotency. Deduplication is harder when data has already triggered downstream side effects.
Should I use Redis alone for idempotency, or is PostgreSQL required?
Redis alone is fine for non-critical workflows where data loss on restart is acceptable — like logging or analytics aggregation. For critical workflows — payments, document generation, database writes — you must back Redis with PostgreSQL (or another durable store). Redis is a cache, not a source of truth. Treat it as one.
How do I handle idempotency keys across different versions of the same agent?
Include a version number in the namespace or hash payload. If Agent A v2 produces different output than Agent A v1 for the same input, they should have different idempotency keys. We use `namespace:agent_name:v2:payload_hash`. This lets you deploy new agent versions without conflicting with old execution traces.
What happens if the idempotency key expires before the agent finishes processing?
Set your TTL to 2x the maximum expected processing time for the longest-running agent. If an agent can take 5 minutes, set TTL to 10 minutes. If the key still expires, your release mechanism (the background job that resets stuck "processing" tasks) handles it. Monitor expired keys as an alert — they indicate agents that are running too slow or crashing.
Related reading: Why Smart CTOs Hire Vietnamese Developers: A Data-Driven Guide to Southeast Asia’s Rising Tech Powerhouse
Related reading: Vietnam Outsourcing: The Smartest Offshore Play for Tech Leaders in 2025