AI Coding Tools Wrote 70% of My Last Feature — Here’s the Audit Trail I Built to Catch the 30% That Would Have Broken Production
It started like any other Tuesday. I needed to ship a real-time inventory sync module for a logistics client. The spec was tight: poll a third-party warehouse API every 30 seconds, reconcile stock levels across 12 regional databases, and fire webhooks on any delta exceeding 0.5%.
I fired up Claude Code, typed in the high-level requirements, and watched the agent crank out ~1,100 lines of TypeScript in under 40 minutes. Beautiful, right?
When AI Agents Talk Past Each Other: Solving the Silent Drift Problem in Multi-Agent Systems
When AI Agents Talk Past Each Other: Solving the Silent Drift Problem in Multi-Agent Systems Multi-agent systems are… ...
It wasn’t. Not even close.
The code compiled. The tests passed. But when I ran it against our production shadow traffic, things started falling apart. A missing `await` here. An off-by-one offset in the pagination loop there. A silent integer overflow when stock quantities hit six digits.
The Quiet Revolution: How Vietnamese Developers Are Making Their Mark on Open Source (Real GitHub Stories)
The Quiet Revolution: How Vietnamese Developers Are Making Their Mark on Open Source (Real GitHub Stories) Let’s be… ...
I’d just witnessed the dirty secret of AI coding tools in one afternoon: *they’re brilliant at writing code that looks correct, but dangerously bad at writing code that is correct*. So I did what any paranoid senior engineer would do. I built an audit trail.
Why Trusting AI-Generated Code Blindly Is a Firing Offense
Let’s be honest. AI coding tools don’t *understand* your production environment. They’ve seen a billion GitHub repos, but they’ve never seen your specific race condition between Redis cache invalidation and a Postgres trigger.
Here’s what my AI agent produced that a standard linter missed entirely:
| Issue | How It Manifested | Detection Method |
|---|---|---|
| Missing `await` in error handler | Silent catch, swallowed exception | Async boundary checker |
| Off-by-one in pagination loop | Skipped last record on page 47 | Invariant assertion |
| Integer overflow on stock > 99,999 | Wrapped to negative | Type range validator |
| Stale closure in interval callback | Sent outdated stock data | Time-stamped diff logger |
Each one would have hit production. Each one could have cost the client thousands in incorrect inventory counts.
The Audit Trail Architecture I Built
I needed a system that didn’t just review the output — it tracked every decision the AI made, compared it against known invariants, and flagged deviations in real-time. Think of it as a *code provenance pipeline*.
Here’s the stack I landed on:
- ECOA AI Platform ACP for the core agent orchestration
- Custom invariant rules written as TypeScript decorators
- Async boundary scanner using Node.js `async_hooks`
- Range validator for all numeric types with production max/min values
- Shadow execution against a cloned traffic stream before any deploy
The Core Loop
Every time the AI agent emitted a code block, the audit trail ran it through 5 gates before it ever touched a repo:
AI Generated Code
→ Gate 1: Syntax + Type Check (ESLint + TypeScript strict)
→ Gate 2: Async Boundary Scan (missing awaits, unhandled rejections)
→ Gate 3: Invariant Assertions (business rules, range checks)
→ Gate 4: Shadow Execution (run against mirrored production traffic)
→ Gate 5: Diff Log (every token change recorded with agent decision ID)
Only if all 5 gates passed did the code merge into a staging branch.
Gate 2 Saved My Ass: The Async Boundary Scanner
The most dangerous bug the AI tool created was a missing `await` inside a `.catch()` error handler. The code looked clean in isolation, but when an upstream API call failed, the error handler swallowed the rejection silently.
Here’s the scanner I wrote using Node.js `async_hooks`. It tracks every async resource created by the agent’s code and verifies that every Promise has either an `await` or a `.catch()` within its execution scope.
typescript
import { AsyncLocalStorage } from 'async_hooks';
const asyncStorage = new AsyncLocalStorage>>();
export function trackAsyncBoundaries() {
const activePromises = new Set>();
asyncStorage.enterWith(activePromises);
const originalPromise = Promise;
// Monkey-patch to track all promises created during AI code execution
// This is intentionally fragile — you want it to break loudly
global.Promise = class extends originalPromise {
constructor(executor) {
super(executor);
activePromises.add(this);
this.finally(() => activePromises.delete(this));
}
};
}
export function verifyAsyncBoundaries(): string[] {
const active = asyncStorage.getStore();
if (!active) return [];
return Array.from(active)
.filter(p => {
// Check if this promise was ever awaited or caught
// Using a WeakMap-based flag set during .then/.catch calls
return !wasHandled(p);
})
.map(p => `Unhandled promise: ${p._debugLabel || 'anonymous'}`);
}
This scanner caught the silent `await` bug on the first run. Without it, that code hits production, the error handler returns `undefined` instead of a fallback inventory value, and the client thinks they have zero stock on 10,000 items.
That’s a P0 incident waiting to happen.
Gate 3: Invariant Assertions That Don’t Lie
AI coding tools have no concept of your business domain’s numeric limits. When the agent generated a `stockLevel` variable typed as `number`, it didn’t know that the warehouse management system uses a 32-bit signed integer internally.
So I wrote a rule engine that reads from a YAML config of invariants:
yaml
invariants:
- path: "stockLevel"
type: "integer"
min: 0
max: 99999
onViolation: "block"
- path: "batchSize"
type: "integer"
min: 1
max: 500
onViolation: "warn"
- path: "retryDelayMs"
type: "integer"
min: 1000
max: 60000
description: "Must be between 1s and 60s"
The last one — `retryDelayMs` — the AI tool set to `500`. Too fast. That would have DDOS’d the warehouse API within 3 seconds. The invariant block caught it.
*Honestly, how many of you have seen an AI agent pick a retry delay that looked reasonable in code but was completely wrong for the actual API rate limit?* I’m guessing most of you raised your hand.
Gate 4: Shadow Execution Against Real Traffic
This is the big one. You can’t trust AI-generated logic until you’ve watched it behave with real data.
We cloned a 1% sample of production traffic and routed it to a shadow environment running the AI-generated code. The results were compared against the current production code’s output. Any deviation > 0.1% in computed values triggered a full diff log.
Here’s what the comparison looked like for the inventory sync:
📊 Shadow Execution Report — 2025-03-21T14:32:00Z
Records processed: 12,847
Matches: 12,422 (96.7%)
Deviations: 425
- 412 within tolerance (±0.5%)
- 13 flagged as anomalies
⛔ Blocked: 3 (integer overflow detected)
Three records had stock values that wrapped to negative. The AI tool had used a `number` subtraction without checking the floor. The audit trail flagged them, the pipeline failed, and I fixed the logic before any deploy.
Why This Matters More Than Your Linter
Let’s be blunt: ESLint won’t catch this. Neither will Prettier, TypeScript’s strict checks, or even a code review by your most senior teammate. The problems AI coding tools introduce are *contextual*. They’re about business logic, not syntax.
A standard linter sees `const stock = 100000 – 5000` and says “looks fine.” Your invariant rule says “max stock is 99999, and we just overflowed.” The difference is the difference between a quiet bug and a post-mortem.
The Real Cost of Not Having an Audit Trail
I’ve been running this pipeline for 3 months now, covering 4 major features delivered by AI-augmented development teams (including our team in Can Tho, Vietnam). The numbers are sobering:
- Total AI-generated lines: 47,000+
- Blocked by audit trail: 14,100+ (30%)
- Of those, confirmed production-breaking: 2,800+ (6%)
- False positives: 3%
Six percent of AI-generated code would have broken production in ways that standard tools could not detect. An audit trail isn’t optional anymore — it’s a requirement for any team using AI coding tools in production.
Consider this: if you’re paying a junior developer $1,000/month via ECOA AI, and they use AI tools to generate code 5x faster, but 6% of that code is subtly broken, you’re not saving money. You’re accumulating technical debt that will explode at the worst possible moment.
*When was the last time your team triaged a production incident caused by code that looked right but was wrong?* If you’re using AI coding tools without an audit trail, the answer is probably “sometime this week.”
What I’d Do Differently Next Time
The current pipeline works, but it’s too heavy for rapid prototyping. I’d love a lighter version that runs entirely in-memory for local development and only persists logs to disk when a gate fails. Something like:
bash
# Run AI agent with local audit trail (no shadow traffic)
agent-code --generate "inventory sync v2" --audit light
Also, I want to add a decision provenance logger that records exactly which prompt, which model, and which temperature setting produced each flagged block. Right now I’m doing that manually, and it’s painful.
Frequently Asked Questions
How do I build an async boundary scanner without modifying global Promise?
You can use `async_hooks` createHooks to track resource IDs instead of monkey-patching. It’s cleaner but requires mapping every async operation back to a parent context. I started with the patch for speed; you can refactor to hooks later.
What’s the performance overhead of shadow execution?
Acceptable for a CI/CD gate. Full shadow runs add 15–30 seconds per feature branch. For local development, I run only gates 1–3 (syntax, async, invariants) which completes in under 2 seconds.
Can I use this with any AI coding tool, or is it specific to ECOA AI Platform ACP?
The audit pipeline is tool-agnostic. It intercepts the code output, not the agent itself. I use ECOA ACP for orchestration because its
Related reading: Why Vietnam Outsourcing is Winning in 2025: A Tech Leader’s Honest Take
Related reading: Outsourcing Software Development in 2025: The CTO’s Honest Playbook for Vietnam vs. India