Build a Custom AI-Powered Code Review Action for GitHub: The Exact Workflow That Caught 94% of Our Convention Violations

Let’s be real. Code reviews are valuable, but nobody enjoys spending 30 minutes flagging missing docstrings or inconsistent variable naming. Your senior engineers definitely don’t.

We hit this wall hard at ECOA AI. Our distributed team—spread across Ho Chi Minh City, Can Tho, and client timezones—was spending nearly 40% of review cycles on mechanical issues. Style stuff. Convention violations. The kind of feedback that makes developers roll their eyes.

Vietnam Outsourcing in 2025: Why Experienced CTOs Are Betting on Southeast Asia’s Rising Tech Hub

TL;DR: Vietnam outsourcing now rivals India and the Philippines in cost and quality, but wins on technical talent… ...

So we automated it.

I’m going to walk you through the exact GitHub Action we built. It uses the OpenAI API to review every pull request, catches convention violations with 94% precision, and frees up our senior devs to focus on actual architecture problems.

The Open Source Efficiency Trap: Why Contributor Workflows Break at Scale (And How to Fix Yours)

The Open Source Efficiency Trap: Why Contributor Workflows Break at Scale (And How to Fix Yours) You built… ...

Why You Need This (Even If You Think You Don’t)

Here’s the thing. AI coding tools are great at generating code. They’re terrible at generating *consistent* code. Every model has its own “style”—and none of them match your team’s conventions out of the box.

We ran an audit on 500 PRs across 3 projects. The result? 23% of all review comments were about style violations that could be automated. That’s nearly a quarter of your senior engineer’s attention wasted on things a linter should catch.

But linters have limits. They can’t tell you “this function is too complex” or “you’re missing error handling for this edge case.” That’s where our AI action comes in.

The Architecture: Simple, Stateless, and Cheap

Here’s the design philosophy: keep it dumb. The action runs on every PR, sends the diff to an LLM with a strict prompt, and posts the results as a PR comment. No database. No state. No complex orchestration.

Cost per review: About $0.03 with GPT-4o-mini. For a team doing 50 PRs a week, that’s $6/month.

Latency: 8-15 seconds per review. Fast enough to feel instant.

Step 1: The Action Definition

First, let’s create the action metadata. This goes in `.github/actions/ai-code-review/action.yml`:

yaml
name: 'AI Code Review'
description: 'Automated code review using LLM to catch convention violations and anti-patterns'
inputs:
  openai-api-key:
    description: 'OpenAI API key'
    required: true
  github-token:
    description: 'GitHub token for posting comments'
    required: true
  model:
    description: 'OpenAI model to use'
    required: false
    default: 'gpt-4o-mini'
  review-depth:
    description: 'How thorough: quick, standard, deep'
    required: false
    default: 'standard'
runs:
  using: 'composite'
  steps:
    - run: python ${{ github.action_path }}/review.py
      shell: bash
      env:
        OPENAI_API_KEY: ${{ inputs.openai-api-key }}
        GITHUB_TOKEN: ${{ inputs.github-token }}
        MODEL: ${{ inputs.model }}
        REVIEW_DEPTH: ${{ inputs.review-depth }}

Simple, right? We’re just defining inputs and pointing to a Python script.

Step 2: The Workflow File

Now wire it into your workflow. Create `.github/workflows/ai-review.yml`:

yaml
name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
      checks: write
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      
      - name: Install dependencies
        run: pip install openai PyGithub
      
      - name: Run AI Code Review
        uses: ./.github/actions/ai-code-review
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          github-token: ${{ secrets.GITHUB_TOKEN }}
          review-depth: 'standard'

Notice the `fetch-depth: 0`. That’s critical—we need the full git history to compute an accurate diff.

Step 3: The Core Review Script

This is where the magic happens. `review.py` inside the action directory:

python
#!/usr/bin/env python3
import os
import json
from github import Github
from openai import OpenAI

# Config
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
MODEL = os.environ.get("MODEL", "gpt-4o-mini")
DEPTH = os.environ.get("REVIEW_DEPTH", "standard")

# Get PR context from GitHub environment
with open(os.environ["GITHUB_EVENT_PATH"]) as f:
    event = json.load(f)

repo_name = event["repository"]["full_name"]
pr_number = event["pull_request"]["number"]

# Init clients
g = Github(GITHUB_TOKEN)
repo = g.get_repo(repo_name)
pr = repo.get_pull(pr_number)

# Get the diff
diffs = []
for file in pr.get_files():
    if file.status == "removed":
        continue
    # Skip lock files and generated code
    if any(skip in file.filename for skip in ["package-lock.json", "yarn.lock", "*.min.js"]):
        continue
    patch = file.patch
    if patch and len(patch) > 5000:  # Skip files with massive diffs
        diffs.append(f"## {file.filename}\n[Diff too large, truncated]")
    elif patch:
        diffs.append(f"## {file.filename}\n```diff\n{patch}\n```")

if not diffs:
    print("No reviewable diffs found")
    exit(0)

# Build the prompt
system_prompt = f"""You are a senior code reviewer. Review the following pull request diff.

Focus ONLY on:
1. Missing error handling (try/except, null checks)
2. Security issues (SQL injection, XSS, hardcoded secrets)
3. Performance problems (N+1 queries, memory leaks)
4. Convention violations (naming, structure, patterns)
5. Missing type hints or docstrings

Do NOT comment on:
- Whitespace or formatting (linters handle this)
- Personal style preferences
- Changes that are clearly refactoring with no logic change

Review depth: {DEPTH}

For each issue, provide:
- Severity: critical/warning/info
- File and line number
- Description
- Suggested fix

Format as JSON array."""

Wait, there’s more. Let’s add the actual API call and comment posting:

python
client = OpenAI(api_key=OPENAI_API_KEY)

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "\n\n".join(diffs[:20])}  # Limit to 20 files
    ],
    response_format={"type": "json_object"},
    temperature=0.1,
    max_tokens=2000
)

try:
    review_data = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
    print("Failed to parse AI response")
    exit(1)

issues = review_data.get("issues", [])
if not issues:
    pr.create_issue_comment("✅ AI Review: No issues found.")
    exit(0)

# Build markdown comment
comment_parts = ["## 🤖 AI Code Review Results\n"]
critical = [i for i in issues if i.get("severity") == "critical"]
warnings = [i for i in issues if i.get("severity") == "warning"]
infos = [i for i in issues if i.get("severity") == "info"]

if critical:
    comment_parts.append(f"### 🔴 Critical ({len(critical)})")
    for issue in critical[:5]:
        comment_parts.append(f"- **{issue['file']}:{issue.get('line', '?')}** - {issue['description']}")

if warnings:
    comment_parts.append(f"\n### 🟡 Warnings ({len(warnings)})")
    for issue in warnings[:10]:
        comment_parts.append(f"- **{issue['file']}:{issue.get('line', '?')}** - {issue['description']}")

if infos:
    comment_parts.append(f"\n### 🔵 Info ({len(infos)})")
    for issue in infos[:5]:
        comment_parts.append(f"- {issue['description']}")

comment_parts.append(f"\n---\n*Review by {MODEL} | {len(issues)} total issues*")

pr.create_issue_comment("\n".join(comment_parts))

The Prompt Engineering That Made It Work

We went through 12 iterations of the system prompt before getting this right. The key insight? Be extremely specific about what NOT to do.

Our first version was too loose. The AI flagged formatting issues that Prettier already handles. It complained about variable names that matched our team’s conventions. It was noisy.

We added the “Do NOT comment on” section and precision jumped from 67% to 94%. More importantly, developer trust went up. Nobody ignores a review that’s actually useful.

What We Caught in the First Week

Here’s a real example from our team in Can Tho. A junior dev submitted a PR with a `try/except` block that was swallowing all exceptions silently:

python
try:
    process_payment(user_id, amount)
except Exception:
    pass

Our AI action flagged it as critical. The suggested fix? Log the exception and re-raise or handle it properly. That’s not something a linter catches. But it’s the kind of bug that causes production outages at 2 AM.

First week stats:

47 PRs reviewed
134 issues flagged
12 critical bugs caught before code review
94% precision (developers agreed with 126 out of 134 flags)

Tuning for Your Team

Every team has different conventions. You’ll want to customize the prompt. Here’s how we handle it:

Add your style guide to the system prompt as a reference document
Set the temperature to 0.1 for consistency—you want deterministic reviews
Adjust the review depth based on team maturity
Start with `quick` mode and gradually increase to `standard`

Honestly, the `deep` mode is overkill for most teams. It catches more issues but takes 30+ seconds and costs 3x more. We only use it for security-critical PRs.

The Trade-offs You Need to Know

This isn’t perfect. Here’s what we’ve learned:

False positives happen. About 6% of flags are wrong. Your team needs to trust their judgment over the AI.

Context matters. The AI doesn’t understand your full codebase. It might flag something that’s intentional technical debt.

It’s not a replacement for human review. This catches mechanical issues. Architecture decisions, trade-off discussions, and team knowledge sharing still need humans.

But for catching the boring stuff? It’s a game changer.

Frequently Asked Questions

Can I run this with a local LLM instead of OpenAI?

Yes. Swap the OpenAI client for Ollama or any OpenAI-compatible local endpoint. We tested with Llama 3 70B and got 82% precision—good enough for most teams, but the latency was 4-8x slower. For $6/month, GPT-4o-mini is hard to beat.

How do I prevent the AI from reviewing every single PR?

Add a label filter. Modify the workflow to skip PRs with a `skip-ai-review` label. Or only run on PRs that modify certain directories. We use path filters to skip documentation-only changes.

What about security? We’re sending code to OpenAI.

Valid concern. Two options: use Azure OpenAI with data residency guarantees, or run a local model. For most teams, the risk is minimal since you’re only sending diffs, not your entire codebase. But if you’re in fintech or healthcare, definitely go local.

How do I handle large PRs with 50+ files?

Our script truncates at 20 files. For larger PRs, we sample the most impactful files based on change count. You can also split the review across multiple API calls, but honestly, if someone’s PR has 50 files, you’ve got bigger problems than code review.