We Built a Custom AI Coding Tools Evaluation Pipeline: Here’s the Architecture That Cut Our Production Bug Rate by 67%
AI coding tools write a lot of code. Some of it is brilliant. Some of it is subtle disaster waiting to happen.
We’ve been using Claude Code, Cursor, and GitHub Copilot across several client projects with our Vietnamese development team in Ho Chi Minh City. And honestly? The productivity boost is real—roughly 3x on boilerplate and well-known patterns. But here’s the problem: every once in a while, the AI generates something that looks perfect but is actually wrong. A race condition hiding in plain sight. A SQL injection that bypasses your ORM. A performance regression that only shows up under load.
Why Smart CTOs Hire Vietnamese Developers: A Data-Driven Guide to Offshore Engineering
TL;DR: Vietnam is now the top destination for offshore software development. You get strong technical skills, a favorable… ...
You can’t rely on code review alone. Humans miss AI-generated bugs at a higher rate because the code looks “natural.” We needed a systematic way to catch these before they hit production.
So we built an evaluation pipeline. This isn’t your typical CI/CD gate. It’s a dedicated harness that runs every AI-suggested code change through a gauntlet of tests, static analysis, and performance benchmarks. It then scores the change and either auto-accepts, flags for review, or blocks deployment.
Why I Ditched Dependabot for Renovate Bot (And My Open Source Projects Have Never Been Healthier)
Why I Ditched Dependabot for Renovate Bot (And My Open Source Projects Have Never Been Healthier) I’ve been… ...
Here’s the exact architecture. (And yes, it’s open-source—you can grab it from our GitHub.)
The Three Gates of Code Evaluation
The pipeline consists of three stages. Each gate produces a score. The final composite score determines the fate of the AI-generated code.
Gate 1: Static Analysis & Convention Compliance
We run every code snippet through a custom linter that enforces our project’s specific conventions. Tools like ESLint, Pylint, or `golangci-lint` are fine for generic rules, but they miss project-specific patterns.
We built a Python-based analysis module that checks for:
- Naming conventions (e.g., camelCase vs snake_case per team rules)
- Import order and grouping
- Docstring coverage (minimum 85% for new functions)
- Hardcoded secrets or credentials
- Deprecated API usage
Here’s the core logic:
python
# evaluator/static_analysis.py
import ast
import pylint.lint
class ConventionAnalyzer:
def __init__(self, project_rules: dict):
self.rules = project_rules
def score(self, file_path: str) -> float:
# 1. Run project-specific AST checks
with open(file_path) as f:
tree = ast.parse(f.read())
violations = self._check_ast(tree)
# 2. Run pylint with custom rcfile
pylint_results = pylint.lint.Run(
[file_path, "--rcfile=.pylintrc"],
do_exit=False
)
pylint_score = pylint_results.linter.stats.global_note
# 3. Combine scores (70% pylint, 30% custom)
final = 0.7 * (pylint_score / 10) + 0.3 * (1 - len(violations) / max(self.rules['max_violations'], 1))
return min(final, 1.0)
We also run `bandit` for Python and `semgrep` for cross-language pattern matching. If the static score drops below 0.8, the change auto-blocked and sent to a human reviewer.
Real stat: After we deployed this gate, we caught 31% of AI-generated code that contained insecure patterns like hardcoded API keys or use of `eval()`.
Gate 2: Unit & Integration Test Suite
This is where most AI coding tools fail. They generate code that passes basic syntax checks but breaks existing tests or introduces untested paths.
Our evaluation pipeline does the following:
- Clones the AI’s suggested code branch in an isolated container.
- Runs the full test suite with `pytest –junitxml=results.xml`.
- Measures test coverage delta (via `coverage.py`).
- Runs mutation testing with `mutmut` to check if tests actually detect bugs.
We set a strict threshold:
- Test pass rate: 100% — no tolerance for regressions.
- Coverage delta: Must not decrease by more than 2% overall.
- Mutation score: At least 85% of mutants must be killed.
If any of these fail, the pipeline triggers a detailed report that pinpoints the exact failing tests and coverage drops.
yaml
# .github/workflows/eval-pipeline.yml (simplified)
jobs:
gate2:
runs-on: ubuntu-latest
steps:
- name: Checkout AI branch
run: git checkout origin/ai-generated-branch
- name: Run tests with coverage
run: |
pytest --cov=. --junitxml=results.xml
coverage report --fail-under=80
- name: Mutation testing
run: mutmut run --paths-to-mutate src/
Hard truth: Initially, 23% of AI-generated pull requests failed Gate 2. The most common culprit? Missing edge cases in error handling—the AI assumed happy paths.
Gate 3: Performance & Load Testing
This is the sneaky one. AI coding tools often produce code that works correctly but is slower by a measurable margin. We’ve seen SQL queries without proper indexing, inefficient loops, and excessive memory allocations.
For performance evaluation, we:
- Run a baseline performance test on the current `main` branch using `locust` for HTTP services or `pytest-benchmark` for library code.
- Run the same test on the AI-generated branch.
- Compare latency percentiles (p50, p95, p99), throughput (requests/sec), and memory usage.
If any metric degrades by more than 5%, the change is flagged as “needs review” with a performance regression report.
python
# evaluator/performance.py
import subprocess
import json
def compare_performance(baseline_path: str, ai_branch_path: str) -> dict:
baseline = run_benchmarks(baseline_path)
ai = run_benchmarks(ai_branch_path)
results = {}
for metric in ['p50', 'p95', 'p99', 'throughput']:
diff = (ai[metric] - baseline[metric]) / baseline[metric] * 100
results[metric] = diff
if diff > 5:
results['flag'] = True
return results
Example: In one sprint, Claude Code generated a batch processing function that passed all tests but was 40% slower due to a missing `bulk_create` instead of individual `save()` calls. Gate 3 caught it.
The Composite Score: How We Gate Deployment
Each gate produces a score between 0 and 1. We combine them with weighted average:
- Static analysis: 20%
- Unit tests: 50%
- Performance: 30%
If the composite score >= 0.9, the AI-generated code is auto-merged (with a log entry). If 0.7–0.9, it goes to an expedited human review. Below 0.7, it’s blocked with a detailed rejection report.
Over 6 months, our production bug rate dropped 67%. Before the pipeline, we had an average of 3.2 production incidents per month attributed to AI-generated code. After, it’s 1.1.
How a Vietnamese AI-Augmented Team Built This
We deployed this pipeline for a US-based SaaS client that uses heavy AI coding tools. Our team in Ho Chi Minh City—six senior engineers—built the entire pipeline in 4 weeks. We used the ECOA AI Platform ACP to orchestrate the agents that generate the code changes. Then we wrapped the evaluation pipeline around it.
The team’s strength wasn’t just writing Python. They understood the *why* behind each gate. For example, they insisted on mutation testing after seeing that standard code coverage can be gamed.
“Code coverage only tells you that lines were executed. It doesn’t tell you if the tests actually check the right behavior.” — Our senior test engineer in Can Tho.
That’s the level of thinking you get when you hire senior Vietnamese engineers. They don’t just implement specs. They challenge assumptions.
Should You Build One?
If your team regularly uses AI coding tools, yes. The upfront investment is 4–6 weeks of development, but it pays for itself in reduced debugging time and fewer production fires.
You can start with a simpler two-gate version (static + unit tests) and add performance later. But don’t skip mutation testing. That’s where the real value lies.
Key takeaway: AI coding tools are amazing. But they’re also junior developers that never sleep. Treat them like one. Put them through a rigorous code review pipeline.
Want the full source code? We’ve open-sourced the evaluation harness on our GitHub. And if you want a team that can build this for you in weeks, not months—well, you know where to find us.
—
Frequently Asked Questions
Why did you choose mutation testing for AI code evaluation?
Mutation testing checks if your tests actually detect bugs. AI-generated code often passes standard tests because the tests only cover happy paths. Mutation testing introduces small changes (mutants) and verifies that at least one test fails. If the mutation survives, you’ve found a test gap. This is especially important for AI code, which tends to follow patterns that tests are designed around.
Can this pipeline work with any AI coding tool?
Yes. The pipeline doesn’t care which tool generated the code—it only looks at the diff. We’ve tested it with Claude Code, Cursor, GitHub Copilot, and raw GPT-4o completions. As long as you can produce a git branch with the suggested changes, the pipeline works.
What threshold do you recommend for the performance gate?
We use a 5% degradation cutoff based on p95 latency. That’s aggressive but necessary for our real-time applications. For background jobs or batch processes, you might set 10–15%. The key is to measure the metric that matters to *your* users, not just a synthetic benchmark.
How much overhead does this pipeline add to CI/CD?
About 3–5 minutes per evaluation run for a typical Python Django project (300 models, 80 views). That’s minimal compared to the time saved by catching bugs early. The pipeline runs in parallel with other CI jobs, so it doesn’t block the main build. We use caching for virtual environments and test fixtures to speed things up.
Related reading: Vietnam Outsourcing: Why Smart CTOs Are Ditching India for Southeast Asia’s Tech Hub
Related reading: Outsourcing Software Development: A CTO’s Honest Playbook for 2025