Build a Custom AI-Powered Code Review Action for GitHub: The Exact Workflow That Caught 94% of Our Convention Violations
Let’s be real. Code reviews are valuable, but nobody enjoys spending 30 minutes flagging missing docstrings or inconsistent variable naming. Your senior engineers definitely don’t.
We hit this wall hard at ECOA AI. Our distributed team—spread across Ho Chi Minh City, Can Tho, and client timezones—was spending nearly 40% of review cycles on mechanical issues. Style stuff. Convention violations. The kind of feedback that makes developers roll their eyes.
Why Vietnam Outsourcing Is the Smartest Move for Your Tech Stack in 2025
TL;DR: Vietnam outsourcing offers a rare mix of high technical talent, competitive costs, and time zone alignment with… ...
So we automated it.
I’m going to walk you through the exact GitHub Action we built. It uses the OpenAI API to review every pull request, catches convention violations with 94% precision, and frees up our senior devs to focus on actual architecture problems.
Build a Custom AI PR Reviewer with Claude API and GitHub Webhooks — Here’s the Exact Code
Build a Custom AI PR Reviewer with Claude API and GitHub Webhooks — Here’s the Exact Code I’ve… ...
Why You Need This (Even If You Think You Don’t)
Here’s the thing. AI coding tools are great at generating code. They’re terrible at generating *consistent* code. Every model has its own “style”—and none of them match your team’s conventions out of the box.
We ran an audit on 500 PRs across 3 projects. The result? 23% of all review comments were about style violations that could be automated. That’s nearly a quarter of your senior engineer’s attention wasted on things a linter should catch.
But linters have limits. They can’t tell you “this function is too complex” or “you’re missing error handling for this edge case.” That’s where our AI action comes in.
The Architecture: Simple, Stateless, and Cheap
Here’s the design philosophy: keep it dumb. The action runs on every PR, sends the diff to an LLM with a strict prompt, and posts the results as a PR comment. No database. No state. No complex orchestration.
Cost per review: About $0.03 with GPT-4o-mini. For a team doing 50 PRs a week, that’s $6/month.
Latency: 8-15 seconds per review. Fast enough to feel instant.
Step 1: The Action Definition
First, let’s create the action metadata. This goes in `.github/actions/ai-code-review/action.yml`:
yaml
name: 'AI Code Review'
description: 'Automated code review using LLM to catch convention violations and anti-patterns'
inputs:
openai-api-key:
description: 'OpenAI API key'
required: true
github-token:
description: 'GitHub token for posting comments'
required: true
model:
description: 'OpenAI model to use'
required: false
default: 'gpt-4o-mini'
review-depth:
description: 'How thorough: quick, standard, deep'
required: false
default: 'standard'
runs:
using: 'composite'
steps:
- run: python ${{ github.action_path }}/review.py
shell: bash
env:
OPENAI_API_KEY: ${{ inputs.openai-api-key }}
GITHUB_TOKEN: ${{ inputs.github-token }}
MODEL: ${{ inputs.model }}
REVIEW_DEPTH: ${{ inputs.review-depth }}
Simple, right? We’re just defining inputs and pointing to a Python script.
Step 2: The Workflow File
Now wire it into your workflow. Create `.github/workflows/ai-review.yml`:
yaml
name: AI Code Review
on:
pull_request:
types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
checks: write
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: pip install openai PyGithub
- name: Run AI Code Review
uses: ./.github/actions/ai-code-review
with:
openai-api-key: ${{ secrets.OPENAI_API_KEY }}
github-token: ${{ secrets.GITHUB_TOKEN }}
review-depth: 'standard'
Notice the `fetch-depth: 0`. That’s critical—we need the full git history to compute an accurate diff.
Step 3: The Core Review Script
This is where the magic happens. `review.py` inside the action directory:
python
#!/usr/bin/env python3
import os
import json
from github import Github
from openai import OpenAI
# Config
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
MODEL = os.environ.get("MODEL", "gpt-4o-mini")
DEPTH = os.environ.get("REVIEW_DEPTH", "standard")
# Get PR context from GitHub environment
with open(os.environ["GITHUB_EVENT_PATH"]) as f:
event = json.load(f)
repo_name = event["repository"]["full_name"]
pr_number = event["pull_request"]["number"]
# Init clients
g = Github(GITHUB_TOKEN)
repo = g.get_repo(repo_name)
pr = repo.get_pull(pr_number)
# Get the diff
diffs = []
for file in pr.get_files():
if file.status == "removed":
continue
# Skip lock files and generated code
if any(skip in file.filename for skip in ["package-lock.json", "yarn.lock", "*.min.js"]):
continue
patch = file.patch
if patch and len(patch) > 5000: # Skip files with massive diffs
diffs.append(f"## {file.filename}\n[Diff too large, truncated]")
elif patch:
diffs.append(f"## {file.filename}\n```diff\n{patch}\n```")
if not diffs:
print("No reviewable diffs found")
exit(0)
# Build the prompt
system_prompt = f"""You are a senior code reviewer. Review the following pull request diff.
Focus ONLY on:
1. Missing error handling (try/except, null checks)
2. Security issues (SQL injection, XSS, hardcoded secrets)
3. Performance problems (N+1 queries, memory leaks)
4. Convention violations (naming, structure, patterns)
5. Missing type hints or docstrings
Do NOT comment on:
- Whitespace or formatting (linters handle this)
- Personal style preferences
- Changes that are clearly refactoring with no logic change
Review depth: {DEPTH}
For each issue, provide:
- Severity: critical/warning/info
- File and line number
- Description
- Suggested fix
Format as JSON array."""
Wait, there’s more. Let’s add the actual API call and comment posting:
python
client = OpenAI(api_key=OPENAI_API_KEY)
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "\n\n".join(diffs[:20])} # Limit to 20 files
],
response_format={"type": "json_object"},
temperature=0.1,
max_tokens=2000
)
try:
review_data = json.loads(response.choices[0].message.content)
except json.JSONDecodeError:
print("Failed to parse AI response")
exit(1)
issues = review_data.get("issues", [])
if not issues:
pr.create_issue_comment("✅ AI Review: No issues found.")
exit(0)
# Build markdown comment
comment_parts = ["## 🤖 AI Code Review Results\n"]
critical = [i for i in issues if i.get("severity") == "critical"]
warnings = [i for i in issues if i.get("severity") == "warning"]
infos = [i for i in issues if i.get("severity") == "info"]
if critical:
comment_parts.append(f"### 🔴 Critical ({len(critical)})")
for issue in critical[:5]:
comment_parts.append(f"- **{issue['file']}:{issue.get('line', '?')}** - {issue['description']}")
if warnings:
comment_parts.append(f"\n### 🟡 Warnings ({len(warnings)})")
for issue in warnings[:10]:
comment_parts.append(f"- **{issue['file']}:{issue.get('line', '?')}** - {issue['description']}")
if infos:
comment_parts.append(f"\n### 🔵 Info ({len(infos)})")
for issue in infos[:5]:
comment_parts.append(f"- {issue['description']}")
comment_parts.append(f"\n---\n*Review by {MODEL} | {len(issues)} total issues*")
pr.create_issue_comment("\n".join(comment_parts))
The Prompt Engineering That Made It Work
We went through 12 iterations of the system prompt before getting this right. The key insight? Be extremely specific about what NOT to do.
Our first version was too loose. The AI flagged formatting issues that Prettier already handles. It complained about variable names that matched our team’s conventions. It was noisy.
We added the “Do NOT comment on” section and precision jumped from 67% to 94%. More importantly, developer trust went up. Nobody ignores a review that’s actually useful.
What We Caught in the First Week
Here’s a real example from our team in Can Tho. A junior dev submitted a PR with a `try/except` block that was swallowing all exceptions silently:
python
try:
process_payment(user_id, amount)
except Exception:
pass
Our AI action flagged it as critical. The suggested fix? Log the exception and re-raise or handle it properly. That’s not something a linter catches. But it’s the kind of bug that causes production outages at 2 AM.
First week stats:
- 47 PRs reviewed
- 134 issues flagged
- 12 critical bugs caught before code review
- 94% precision (developers agreed with 126 out of 134 flags)
Tuning for Your Team
Every team has different conventions. You’ll want to customize the prompt. Here’s how we handle it:
- Add your style guide to the system prompt as a reference document
- Set the temperature to 0.1 for consistency—you want deterministic reviews
- Adjust the review depth based on team maturity
- Start with `quick` mode and gradually increase to `standard`
Honestly, the `deep` mode is overkill for most teams. It catches more issues but takes 30+ seconds and costs 3x more. We only use it for security-critical PRs.
The Trade-offs You Need to Know
This isn’t perfect. Here’s what we’ve learned:
False positives happen. About 6% of flags are wrong. Your team needs to trust their judgment over the AI.
Context matters. The AI doesn’t understand your full codebase. It might flag something that’s intentional technical debt.
It’s not a replacement for human review. This catches mechanical issues. Architecture decisions, trade-off discussions, and team knowledge sharing still need humans.
But for catching the boring stuff? It’s a game changer.
Frequently Asked Questions
Can I run this with a local LLM instead of OpenAI?
Yes. Swap the OpenAI client for Ollama or any OpenAI-compatible local endpoint. We tested with Llama 3 70B and got 82% precision—good enough for most teams, but the latency was 4-8x slower. For $6/month, GPT-4o-mini is hard to beat.
How do I prevent the AI from reviewing every single PR?
Add a label filter. Modify the workflow to skip PRs with a `skip-ai-review` label. Or only run on PRs that modify certain directories. We use path filters to skip documentation-only changes.
What about security? We’re sending code to OpenAI.
Valid concern. Two options: use Azure OpenAI with data residency guarantees, or run a local model. For most teams, the risk is minimal since you’re only sending diffs, not your entire codebase. But if you’re in fintech or healthcare, definitely go local.
How do I handle large PRs with 50+ files?
Our script truncates at 20 files. For larger PRs, we sample the most impactful files based on change count. You can also split the review across multiple API calls, but honestly, if someone’s PR has 50 files, you’ve got bigger problems than code review.
Related reading: Outsourcing Software Isn’t Cheap. It’s Efficient.
Related reading: Hire Vietnamese Developers: The Smart Offshore Play for 2024 and Beyond