Stop Pushing Buggy Code: How We Built a Multi-Agent Code Review Pipeline That Actually Catches Problems

1 comment
(Developer Tutorials) - Code review is broken. Here’s how we built a multi-agent pipeline using Python and the ECOA ACP that automates style checks, logic validation, and security scans — without the noise.

Stop Pushing Buggy Code: How We Built a Multi-Agent Code Review Pipeline That Actually Catches Problems

Let’s be honest. Code reviews are usually a bottleneck. Whether it’s your senior dev spending two hours checking for missing type hints or a junior PR sitting unreviewed for days — the process is slow, inconsistent, and frankly, boring.

We hit this exact wall at ECOA AI. Our team in Can Tho was shipping fast, but the review queue kept growing. We needed something smarter.

We Cut a Legacy Fintech’s Batch Processing from 4 Hours to 12 Minutes — Here’s the Exact Architecture We Used

We Cut a Legacy Fintech’s Batch Processing from 4 Hours to 12 Minutes — Here’s the Exact Architecture We Used

We Cut a Legacy Fintech’s Batch Processing from 4 Hours to 12 Minutes — Here’s the Exact Architecture… ...

So we built a multi-agent code review pipeline. It’s not magic. It’s just three specialized AI agents working together: one for style, one for logic, and one for security. Here’s exactly how we did it.

Why a Single Agent Fails at Code Review

A single LLM prompt like “check this code for bugs” gives you vague garbage. It misses context. It confuses style warnings with critical security flaws. You end up with a list of 50 comments, 48 of which are useless.

How to Build a Custom GitHub Action: A Step-by-Step Developer Tutorial for 2026

How to Build a Custom GitHub Action: A Step-by-Step Developer Tutorial for 2026

How to Build a Custom GitHub Action: A Step-by-Step Developer Tutorial for 2026 You’ve written the same CI… ...

The fix? Split the work. Let each agent focus on one responsibility.

That’s the core idea behind the ECOA AI Platform ACP (Agent Collaboration Protocol) . You define agents, give them roles, and orchestrate their output into a single, structured review.

The Architecture: Three Specialized Agents

Here’s our pipeline layout:


[Git Push] → [GitHub Webhook] → [ECOA Orchestrator]
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
             [Style Agent]   [Logic Agent]   [Security Agent]
                    │               │               │
                    └───────────────┼───────────────┘
                                    ▼
                            [Aggregator]
                                    ▼
                           [Comment on PR]

Each agent reads the same diff, but is prompted with a different role. The aggregator merges results, deduplicates, and posts a single summary.

Step 1: Setting Up the Agents on ECOA ACP

We use the ECOA ACP SDK. Agent creation is dead simple. Here’s a snippet from our config:

python
from ecoa_acp import Agent, Pipeline

style_agent = Agent(
    name="style_guardian",
    role="Enforce PEP8 and project conventions. Check for line length, naming, imports.",
    model="gpt-4o",
    temperature=0.1
)

logic_agent = Agent(
    name="logic_validator",
    role="Find potential runtime errors, off-by-one bugs, type mismatches, and race conditions.",
    model="gpt-4o",
    temperature=0.3
)

security_agent = Agent(
    name="security_scanner",
    role="Identify SQL injection, hardcoded secrets, XSS, insecure deserialization.",
    model="gpt-4o-mini",
    temperature=0.1
)

pipeline = Pipeline(
    agents=[style_agent, logic_agent, security_agent],
    aggregator="merge"
)

Notice the `temperature` values. Style checks need low creativity. Logic validation benefits from a tiny bit of reasoning variance. Keep it tight.

Step 2: Feeding the Diff to the Pipeline

We subscribe to a GitHub webhook. When a PR is opened, we fetch the diff and send it to the orchestrator.

python
import requests
from ecoa_acp import Orchestrator

orchestrator = Orchestrator(pipeline)

diff = requests.get(patch_url).text  # the unified diff

result = orchestrator.run(diff)

The `orchestrator.run()` call fans out the diff to all three agents in parallel. On our setup, this takes about 8 seconds for a medium-sized PR.

I should mention — we run this on a small cluster in Ho Chi Minh City. Latency is under 50ms between agents. That matters when you’re reviewing 30 PRs a day.

Step 3: Aggregating the Results

The raw output from each agent is a list of findings. The aggregator deduplicates them and assigns severity levels.

Here’s sample aggregated output:

json
{
  "summary": "Reviewed 12 files. 3 critical, 7 warnings, 22 style nits.",
  "findings": [
    {
      "file": "src/api/endpoints.py",
      "line": 47,
      "agent": "security_scanner",
      "severity": "critical",
      "message": "Hardcoded API key found. Use environment variable."
    },
    {
      "file": "src/utils/parser.py",
      "line": 23,
      "agent": "logic_validator",
      "severity": "warning",
      "message": "Potential NoneType access on 'data.get(\"key\")'. Use .get() with default."
    }
  ]
}

We then post this summary as a single PR comment. No spam. No 50 individual reviews. Just a clean, actionable report.

Real Results: What This Caught in Production

We’ve been running this pipeline for 3 months. Here are the numbers:

Metric Before (Manual) After (Multi-Agent)
Review time per PR 45 minutes 8 minutes
Bugs found before deploy 62% 91%
False positive rate High 12%
Developer satisfaction 3.2/5 4.6/5

The biggest win? Catching a critical SQL injection in a new search feature. The developer had used f-strings. The security agent flagged it instantly. That saved us a potential data breach.

Why This Works in Vietnam’s Offshoring Ecosystem

Look, I’ve worked with teams all over. Vietnamese developers are some of the most disciplined coders I’ve met. But even the best teams need automated guardrails. That’s especially true when you’re scaling across time zones. Our team in Can Tho ships code at 10 PM their time. By morning, the review is already done.

The ECOA platform makes it trivial to add these guardrails without hiring extra reviewers. It’s not replacing humans. It’s giving them superpowers.

The Pitfall You Must Avoid

Don’t make the aggregator too aggressive. We tried collapsing all findings into one “PASS/FAIL” score. Developers hated it. They wanted context.

Keep the individual findings visible. Let the developer see which agent flagged what. Trust me, it makes a difference in how seriously they take the feedback.

The Tech Behind the Speed

Each agent gets a dedicated context window. We limit the diff to 3000 tokens. If the PR is larger, we chunk it. The orchestrator handles chunking automatically.

Here’s the chunking config:

python
pipeline = Pipeline(
    agents=[style_agent, logic_agent, security_agent],
    max_tokens_per_call=3000,
    chunk_strategy="file_boundary",
    aggregator="merge_with_severity"
)

The `file_boundary` strategy keeps each file’s analysis coherent. No mixing half of one file with half of another.

When You Shouldn’t Use This

This pipeline works great for Python, JavaScript, TypeScript, and Go. But if your codebase is primarily SQL stored procedures or YAML configs, the agents will struggle. We tried. They hallucinate table names.

Also, don’t expect this to replace architectural reviews. An agent can’t tell you if your microservice boundaries are wrong. That’s still a human job.

Final Thoughts

Building this pipeline took our team about 4 days. The first day was just prompting. Getting the agents to stop arguing about indentation was hilarious. The style agent kept flagging tabs as errors, while the logic agent didn’t care. That’s exactly why we needed separate agents — each one sticks to its lane.

If your review process feels broken, don’t throw more humans at it. Build a multi-agent system. It’s cheaper, faster, and honestly, more fun.

You’ll thank me when your next PR gets reviewed in under 10 minutes.

Frequently Asked Questions

Can I run this pipeline without the ECOA ACP platform?

Yes, but you’ll have to build your own orchestration layer. You’d need to manage parallel API calls, chunking, and result merging manually. The ECOA ACP handles all that out of the box. If you’re prototyping, start with raw `asyncio` calls. For production, use the platform.

How do I handle false positives from the logic agent?

Lower the temperature to 0.1 and add explicit examples in the prompt. For instance: “Do not flag `assert` statements as errors unless they are in production code.” Also, keep a feedback loop — let developers mark findings as “not actionable” and retrain the agent’s prompts weekly.

Does this work for monorepos with multiple languages?

Yes, but you need separate agents per language. Create a dedicated `style_go_agent` and `style_python_agent`. The orchestrator can route files based on extension. We do this for our monorepo that has Python, TypeScript, and Rust.

What about cost? Isn’t running three agents expensive?

It’s cheaper than you think. Each agent processes about 3000 tokens per run. At GPT-4o pricing, that’s roughly $0.002 per review. Even at 100 reviews a day, that’s $0.20. Compare that to a senior developer’s hourly rate. The math works.

Related reading: Outsourcing Software in 2025: Why Smart CTOs Are Ditching the Old Playbook

Related reading: Why You Should Hire Vietnamese Developers: A Strategic Play for Tech Leaders

Leave a Comment

Your email address will not be published. Required fields are marked *

Ready to Build with AI-Powered Developers?

Hire Vietnamese engineers augmented by ECOA AI Platform + Claude Code. 5x faster, 40% cheaper.