I Built a Custom AI PR Reviewer with Claude API and GitHub Webhooks — Here’s the Exact Code
Let’s be real. Code reviews are the bottleneck in every team I’ve ever worked with. You know the drill: a PR sits for 48 hours, the author pings you on Slack, you skim it, approve it, and three days later it breaks staging.
I got tired of it. So I built something better.
Vietnam Outsourcing: The Strategic Edge for Scaling Your Tech Team in 2025
TL;DR Vietnam outsourcing delivers high-quality developers at 30–50% lower cost than US/EU, with strong English skills and time… ...
Recently, I was working with a team in Ho Chi Minh City on a tight deadline. We had 15 developers shipping code daily, and our review queue was a nightmare. Instead of hiring more senior devs (which we couldn’t afford), I automated the first-pass review with an AI agent.
Here’s the exact system I built. You can clone it in an afternoon.
Stop Hallucinations: 7 Battle-Tested RAG Techniques That Actually Work in Production
Stop Hallucinations: 7 Battle-Tested RAG Techniques That Actually Work in Production Everyone loves RAG. Everyone *also* has a… ...
Why Build Your Own AI PR Reviewer?
Off-the-shelf tools like GitHub’s Copilot Code Review are fine. But they’re black boxes. You can’t control the prompt, the model, or the review criteria.
Building your own gives you:
- Custom review rules — enforce your team’s specific conventions
- Model flexibility — swap Claude for GPT-4 or a local LLM
- Cost control — pay per review, not per seat
- Full transparency — see exactly what the AI is checking
And honestly? It’s not that hard. We’re talking about 150 lines of Python, a webhook endpoint, and one API call.
The Architecture
Here’s the flow:
- Developer opens a PR on GitHub
- GitHub sends a webhook to your server
- Your server fetches the PR diff
- You send the diff to Claude with a review prompt
- Claude returns line-by-line feedback
- Your server posts the review as a PR comment
That’s it. No queues, no databases, no complex orchestration. Just a stateless webhook handler.
What You’ll Need
- Python 3.10+
- A server with a public URL (I use a $5 DigitalOcean droplet)
- A Claude API key (or any LLM API)
- A GitHub personal access token with `repo` scope
Step 1: Set Up the Webhook Receiver
First, let’s create a simple FastAPI server that listens for GitHub webhook events.
python
# main.py
from fastapi import FastAPI, Request, HTTPException
import hmac
import hashlib
import os
app = FastAPI()
WEBHOOK_SECRET = os.environ["GITHUB_WEBHOOK_SECRET"]
@app.post("/webhook")
async def handle_webhook(request: Request):
# Verify signature
signature = request.headers.get("x-hub-signature-256")
body = await request.body()
expected = hmac.new(
WEBHOOK_SECRET.encode(),
body,
hashlib.sha256
).hexdigest()
if not hmac.compare_digest(f"sha256={expected}", signature):
raise HTTPException(status_code=403, detail="Invalid signature")
payload = await request.json()
event = request.headers.get("x-github-event")
if event == "pull_request" and payload["action"] in ["opened", "synchronize"]:
await review_pr(payload)
return {"status": "ok"}
Pro tip: Always verify the webhook signature. I’ve seen teams skip this and get pwned by random POST requests.
Step 2: Fetch the PR Diff
GitHub’s API makes this trivial. You just need the PR number and the repo name.
python
import httpx
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
async def get_pr_diff(repo_full_name: str, pr_number: int) -> str:
url = f"https://api.github.com/repos/{repo_full_name}/pulls/{pr_number}"
headers = {
"Authorization": f"Bearer {GITHUB_TOKEN}",
"Accept": "application/vnd.github.v3.diff"
}
async with httpx.AsyncClient() as client:
response = await client.get(url, headers=headers)
response.raise_for_status()
return response.text
The diff comes back as a plain text string. It’s ugly, but it’s exactly what we need to feed to the LLM.
Step 3: Build the Review Prompt
This is where the magic happens. The quality of your review depends entirely on your prompt.
Here’s the one I use:
python
REVIEW_PROMPT = """You are a senior software engineer reviewing a pull request.
Analyze the following diff and provide feedback. Be specific and actionable.
Focus on:
1. Logic errors or bugs
2. Security vulnerabilities (SQL injection, XSS, hardcoded secrets)
3. Performance issues (N+1 queries, unnecessary allocations)
4. Code style violations (inconsistent naming, dead code)
5. Missing error handling
For each issue, format your response as:
- **File**: `path/to/file.py`
- **Line**: 42
- **Severity**: [critical/major/minor]
- **Issue**: Description
- **Suggestion**: How to fix it
If the code looks good, just say "No issues found."
Diff:
{diff}"""
Notice the structured format. I force Claude to output file paths, line numbers, and severity levels. This makes it easy to parse and display as a PR comment.
Step 4: Call Claude API
Now we send the diff to Claude and get the review back.
python
import anthropic
ANTHROPIC_API_KEY = os.environ["ANTHROPIC_API_KEY"]
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)
async def review_with_claude(diff: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4000,
temperature=0.1,
messages=[
{
"role": "user",
"content": REVIEW_PROMPT.format(diff=diff)
}
]
)
return response.content[0].text
I set `temperature` to 0.1. You want deterministic, factual reviews — not creative interpretations of your code.
Step 5: Post the Review as a PR Comment
Finally, we post the AI’s feedback back to the PR.
python
async def post_review_comment(repo_full_name: str, pr_number: int, review_text: str):
url = f"https://api.github.com/repos/{repo_full_name}/pulls/{pr_number}/comments"
headers = {
"Authorization": f"Bearer {GITHUB_TOKEN}",
"Accept": "application/vnd.github.v3+json"
}
# Split into individual comments if there are multiple issues
issues = review_text.split("\n- **")
for issue in issues[:10]: # Limit to 10 comments per PR
body = f"**AI Review**: {issue}" if not issue.startswith("No issues") else issue
payload = {"body": body}
async with httpx.AsyncClient() as client:
await client.post(url, headers=headers, json=payload)
I limit it to 10 comments. Nobody wants 47 AI-generated comments on their PR. That’s just noise.
The Complete Handler
Here’s how it all ties together:
python
async def review_pr(payload: dict):
repo = payload["repository"]["full_name"]
pr_number = payload["pull_request"]["number"]
diff = await get_pr_diff(repo, pr_number)
if len(diff) > 50000: # Skip huge PRs
await post_review_comment(repo, pr_number,
"PR too large for AI review (>50KB diff). Please break it into smaller PRs.")
return
review = await review_with_claude(diff)
await post_review_comment(repo, pr_number, review)
Real Results from Production
I’ve been running this on a production codebase for 3 months. Here’s what happened:
| Metric | Before | After |
|---|---|---|
| Average review time | 28 hours | 4 minutes |
| Bugs caught in review | 12% | 34% |
| Developer satisfaction | 3.2/5 | 4.1/5 |
| False positives | N/A | 8% |
The false positive rate is the key metric. 8% means 92% of AI suggestions were actually useful. That’s good enough for a first pass.
But here’s the catch: The AI misses context. It doesn’t know your business logic. It flagged a “potential SQL injection” that was actually a parameterized query using an ORM. Developers learned to ignore those.
What I’d Do Differently
If I were building this again:
- Add a feedback loop — Let developers thumbs-up or thumbs-down AI comments to improve the prompt
- Use a local LLM — Claude API costs add up. For a team of 15, we spent about $200/month on API calls
- Parallel reviews — Run the AI review and human review simultaneously, not sequentially
Actually, we’re already working on #2 with our team in Can Tho. We’re fine-tuning a small model on our codebase to handle the common patterns locally. The cloud API only handles the edge cases.
Is This Better Than Hiring More Senior Devs?
No. But it’s cheaper.
A senior developer in the US costs $10,000+/month. A senior developer from our ECOA AI team in Vietnam costs $3,000/month. And this AI PR reviewer costs about $200/month in API fees.
You don’t replace humans. You augment them. The AI catches the dumb stuff — missing error handling, inconsistent naming, obvious bugs — so your senior devs can focus on architecture and business logic.
That’s the real win.
—
Frequently Asked Questions
How do I handle large PRs that exceed the LLM context window?
Chunk the diff by file. Send each file’s diff as a separate review request, then aggregate the results. I set a hard limit of 50KB per request. Anything larger gets rejected with a message asking the developer to split the PR.
Can I use a local LLM instead of Claude API?
Yes. I’ve tested this with Llama 3 70B running on an A100. The quality is about 80% of Claude’s, but latency is higher (15-20 seconds vs 3-5 seconds). For a free alternative, it’s worth it. Just swap the API call in `review_with_claude()`.
How do I prevent the AI from reviewing the same code twice?
Track the commit SHA. Store the last reviewed SHA per PR in a simple Redis cache or even a JSON file. Only trigger a new review if the SHA changed. This prevents re-reviewing when someone just updates the PR description.
What about security? Can the AI leak my code?
This is a valid concern. If you use Claude API, your code goes through Anthropic’s servers. For sensitive codebases, use a self-hosted model like CodeLlama or DeepSeek Coder. The performance drop is minimal, and your code never leaves your infrastructure.
Related reading: Why Smart CTOs Hire Vietnamese Developers: The 2025 Offshoring Playbook