AI Coding Tools Wrote 70% of My Last Feature — Here’s the Audit Trail I Built to Catch the 30% That Would Have Broken Production

1 comment
(AI Coding Tools) - AI coding tools generated most of my last feature in hours. But 30% of that code was subtly wrong — race conditions, incorrect state handling, silent data corruption. I built a custom audit trail pipeline that caught every single one before deploy. Here's the exact architecture and rules.

AI Coding Tools Wrote 70% of My Last Feature — Here’s the Audit Trail I Built to Catch the 30% That Would Have Broken Production

It started like any other Tuesday. I needed to ship a real-time inventory sync module for a logistics client. The spec was tight: poll a third-party warehouse API every 30 seconds, reconcile stock levels across 12 regional databases, and fire webhooks on any delta exceeding 0.5%.

I fired up Claude Code, typed in the high-level requirements, and watched the agent crank out ~1,100 lines of TypeScript in under 40 minutes. Beautiful, right?

When AI Agents Talk Past Each Other: Solving the Silent Drift Problem in Multi-Agent Systems

When AI Agents Talk Past Each Other: Solving the Silent Drift Problem in Multi-Agent Systems

When AI Agents Talk Past Each Other: Solving the Silent Drift Problem in Multi-Agent Systems Multi-agent systems are… ...

It wasn’t. Not even close.

The code compiled. The tests passed. But when I ran it against our production shadow traffic, things started falling apart. A missing `await` here. An off-by-one offset in the pagination loop there. A silent integer overflow when stock quantities hit six digits.

The Quiet Revolution: How Vietnamese Developers Are Making Their Mark on Open Source (Real GitHub Stories)

The Quiet Revolution: How Vietnamese Developers Are Making Their Mark on Open Source (Real GitHub Stories)

The Quiet Revolution: How Vietnamese Developers Are Making Their Mark on Open Source (Real GitHub Stories) Let’s be… ...

I’d just witnessed the dirty secret of AI coding tools in one afternoon: *they’re brilliant at writing code that looks correct, but dangerously bad at writing code that is correct*. So I did what any paranoid senior engineer would do. I built an audit trail.

Why Trusting AI-Generated Code Blindly Is a Firing Offense

Let’s be honest. AI coding tools don’t *understand* your production environment. They’ve seen a billion GitHub repos, but they’ve never seen your specific race condition between Redis cache invalidation and a Postgres trigger.

Here’s what my AI agent produced that a standard linter missed entirely:

Issue How It Manifested Detection Method
Missing `await` in error handler Silent catch, swallowed exception Async boundary checker
Off-by-one in pagination loop Skipped last record on page 47 Invariant assertion
Integer overflow on stock > 99,999 Wrapped to negative Type range validator
Stale closure in interval callback Sent outdated stock data Time-stamped diff logger

Each one would have hit production. Each one could have cost the client thousands in incorrect inventory counts.

The Audit Trail Architecture I Built

I needed a system that didn’t just review the output — it tracked every decision the AI made, compared it against known invariants, and flagged deviations in real-time. Think of it as a *code provenance pipeline*.

Here’s the stack I landed on:

  • ECOA AI Platform ACP for the core agent orchestration
  • Custom invariant rules written as TypeScript decorators
  • Async boundary scanner using Node.js `async_hooks`
  • Range validator for all numeric types with production max/min values
  • Shadow execution against a cloned traffic stream before any deploy

The Core Loop

Every time the AI agent emitted a code block, the audit trail ran it through 5 gates before it ever touched a repo:


AI Generated Code
    → Gate 1: Syntax + Type Check (ESLint + TypeScript strict)
    → Gate 2: Async Boundary Scan (missing awaits, unhandled rejections)
    → Gate 3: Invariant Assertions (business rules, range checks)
    → Gate 4: Shadow Execution (run against mirrored production traffic)
    → Gate 5: Diff Log (every token change recorded with agent decision ID)

Only if all 5 gates passed did the code merge into a staging branch.

Gate 2 Saved My Ass: The Async Boundary Scanner

The most dangerous bug the AI tool created was a missing `await` inside a `.catch()` error handler. The code looked clean in isolation, but when an upstream API call failed, the error handler swallowed the rejection silently.

Here’s the scanner I wrote using Node.js `async_hooks`. It tracks every async resource created by the agent’s code and verifies that every Promise has either an `await` or a `.catch()` within its execution scope.

typescript
import { AsyncLocalStorage } from 'async_hooks';

const asyncStorage = new AsyncLocalStorage>>();

export function trackAsyncBoundaries() {
  const activePromises = new Set>();
  asyncStorage.enterWith(activePromises);

  const originalPromise = Promise;
  // Monkey-patch to track all promises created during AI code execution
  // This is intentionally fragile — you want it to break loudly
  global.Promise = class extends originalPromise {
    constructor(executor) {
      super(executor);
      activePromises.add(this);
      this.finally(() => activePromises.delete(this));
    }
  };
}

export function verifyAsyncBoundaries(): string[] {
  const active = asyncStorage.getStore();
  if (!active) return [];
  return Array.from(active)
    .filter(p => {
      // Check if this promise was ever awaited or caught
      // Using a WeakMap-based flag set during .then/.catch calls
      return !wasHandled(p);
    })
    .map(p => `Unhandled promise: ${p._debugLabel || 'anonymous'}`);
}

This scanner caught the silent `await` bug on the first run. Without it, that code hits production, the error handler returns `undefined` instead of a fallback inventory value, and the client thinks they have zero stock on 10,000 items.

That’s a P0 incident waiting to happen.

Gate 3: Invariant Assertions That Don’t Lie

AI coding tools have no concept of your business domain’s numeric limits. When the agent generated a `stockLevel` variable typed as `number`, it didn’t know that the warehouse management system uses a 32-bit signed integer internally.

So I wrote a rule engine that reads from a YAML config of invariants:

yaml
invariants:
  - path: "stockLevel"
    type: "integer"
    min: 0
    max: 99999
    onViolation: "block"

  - path: "batchSize"
    type: "integer"
    min: 1
    max: 500
    onViolation: "warn"

  - path: "retryDelayMs"
    type: "integer"
    min: 1000
    max: 60000
    description: "Must be between 1s and 60s"

The last one — `retryDelayMs` — the AI tool set to `500`. Too fast. That would have DDOS’d the warehouse API within 3 seconds. The invariant block caught it.

*Honestly, how many of you have seen an AI agent pick a retry delay that looked reasonable in code but was completely wrong for the actual API rate limit?* I’m guessing most of you raised your hand.

Gate 4: Shadow Execution Against Real Traffic

This is the big one. You can’t trust AI-generated logic until you’ve watched it behave with real data.

We cloned a 1% sample of production traffic and routed it to a shadow environment running the AI-generated code. The results were compared against the current production code’s output. Any deviation > 0.1% in computed values triggered a full diff log.

Here’s what the comparison looked like for the inventory sync:


📊 Shadow Execution Report — 2025-03-21T14:32:00Z
  Records processed: 12,847
  Matches: 12,422 (96.7%)
  Deviations: 425
    - 412 within tolerance (±0.5%)
    - 13 flagged as anomalies
  ⛔ Blocked: 3 (integer overflow detected)

Three records had stock values that wrapped to negative. The AI tool had used a `number` subtraction without checking the floor. The audit trail flagged them, the pipeline failed, and I fixed the logic before any deploy.

Why This Matters More Than Your Linter

Let’s be blunt: ESLint won’t catch this. Neither will Prettier, TypeScript’s strict checks, or even a code review by your most senior teammate. The problems AI coding tools introduce are *contextual*. They’re about business logic, not syntax.

A standard linter sees `const stock = 100000 – 5000` and says “looks fine.” Your invariant rule says “max stock is 99999, and we just overflowed.” The difference is the difference between a quiet bug and a post-mortem.

The Real Cost of Not Having an Audit Trail

I’ve been running this pipeline for 3 months now, covering 4 major features delivered by AI-augmented development teams (including our team in Can Tho, Vietnam). The numbers are sobering:

  • Total AI-generated lines: 47,000+
  • Blocked by audit trail: 14,100+ (30%)
  • Of those, confirmed production-breaking: 2,800+ (6%)
  • False positives: 3%

Six percent of AI-generated code would have broken production in ways that standard tools could not detect. An audit trail isn’t optional anymore — it’s a requirement for any team using AI coding tools in production.

Consider this: if you’re paying a junior developer $1,000/month via ECOA AI, and they use AI tools to generate code 5x faster, but 6% of that code is subtly broken, you’re not saving money. You’re accumulating technical debt that will explode at the worst possible moment.

*When was the last time your team triaged a production incident caused by code that looked right but was wrong?* If you’re using AI coding tools without an audit trail, the answer is probably “sometime this week.”

What I’d Do Differently Next Time

The current pipeline works, but it’s too heavy for rapid prototyping. I’d love a lighter version that runs entirely in-memory for local development and only persists logs to disk when a gate fails. Something like:

bash
# Run AI agent with local audit trail (no shadow traffic)
agent-code --generate "inventory sync v2" --audit light

Also, I want to add a decision provenance logger that records exactly which prompt, which model, and which temperature setting produced each flagged block. Right now I’m doing that manually, and it’s painful.

Frequently Asked Questions

How do I build an async boundary scanner without modifying global Promise?

You can use `async_hooks` createHooks to track resource IDs instead of monkey-patching. It’s cleaner but requires mapping every async operation back to a parent context. I started with the patch for speed; you can refactor to hooks later.

What’s the performance overhead of shadow execution?

Acceptable for a CI/CD gate. Full shadow runs add 15–30 seconds per feature branch. For local development, I run only gates 1–3 (syntax, async, invariants) which completes in under 2 seconds.

Can I use this with any AI coding tool, or is it specific to ECOA AI Platform ACP?

The audit pipeline is tool-agnostic. It intercepts the code output, not the agent itself. I use ECOA ACP for orchestration because its

Related reading: Why Vietnam Outsourcing is Winning in 2025: A Tech Leader’s Honest Take

Related reading: Outsourcing Software Development in 2025: The CTO’s Honest Playbook for Vietnam vs. India

Leave a Comment

Your email address will not be published. Required fields are marked *

Ready to Build with AI-Powered Developers?

Hire Vietnamese engineers augmented by ECOA AI Platform + Claude Code. 5x faster, 40% cheaper.