Build a Custom AI-Powered PR Summarizer: A Developer’s Practical Guide

Let’s be honest. Reviewing pull requests is where good teams separate from average ones. But reading through a 400-line diff, context-switching into the problem domain, and mentally reconstructing the author’s intent? That’s pure overhead.

We got tired of it. So we built a bot that does the heavy lifting.

Why You Should Hire Vietnamese Developers: The Underrated Powerhouse of Offshore Tech Talent

TL;DR: Vietnam has quietly become one of the best destinations for offshore software development. With strong math education,… ...

Here’s the exact architecture we used to build an AI-powered PR summarizer that posts structured summaries to every new pull request. It cut our team’s code review prep time by roughly 70%. Not bad for a weekend project.

Why Summarize PRs with AI?

I know what you’re thinking. *Another AI bot that clutters up your repo?*

How to Build a Production-Ready RAG Pipeline: A Developer’s Guide to Vector Search, Chunking, and LLM Integration

How to Build a Production-Ready RAG Pipeline: A Developer’s Guide to Vector Search, Chunking, and LLM Integration Let’s… ...

Hear me out. The problem isn’t the code review itself. It’s the context gathering. You open a PR, scan the title and description (if they wrote one), then start reading diff hunks, trying to piece together what changed and why. That’s the expensive part.

A good PR summarizer does three things:

Extracts the structural changes: new files, deleted files, changed APIs.
Infers the intent: not just *what* changed, but *why*.
Points out risk areas: large deletions, dependency changes, test modifications.

We needed this across 15+ active microservices. Manual summarization was dead on arrival.

The Architecture: Simple, Stateless, Cheap

Here’s our setup. It’s intentionally boring. Boring is reliable.

Trigger: GitHub webhook on `pull_request.opened` and `pull_request.synchronize`.
Processor: A lightweight Flask app (deployed as a Cloud Run function).
Diff fetcher: Direct GitHub API call using the repo’s default token.
Summarizer: OpenAI GPT-4o-mini (costs pennies per summary).
Poster: GitHub API to create a comment or update an existing one.

No databases. No queues for this use case. If the function fails, GitHub retries the webhook. That’s good enough.

The Code: Your Starting Point

Here’s the core script. It’s concise but production-hardened with error handling and token management.

python
import os
import httpx
from flask import Flask, request, jsonify
from openai import OpenAI

app = Flask(__name__)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")

SYSTEM_PROMPT = """You are an expert code reviewer's assistant. Given a git diff, write a concise summary of the pull request.

Include:
1. **Summary**: What does this PR do in one sentence?
2. **Key Changes**: Bullet list of the most impactful file changes.
3. **Risk Flags**: Any large deletions, dependency changes, or security-sensitive modifications.
4. **Review Focus**: What the reviewer should pay attention to.

Keep it under 250 words. Use Markdown formatting."""

def fetch_diff(payload):
    """Get the unified diff for a PR."""
    headers = {"Authorization": f"Bearer {GITHUB_TOKEN}", "Accept": "application/vnd.github.v3.diff"}
    diff_url = payload["pull_request"]["diff_url"]
    resp = httpx.get(diff_url, headers=headers, timeout=30.0)
    resp.raise_for_status()
    
    # Truncate extremely large diffs to avoid token overload
    diff_text = resp.text
    if len(diff_text) > 15000:
        diff_text = diff_text[:15000] + "\n\n... [diff truncated due to size]"
    return diff_text

def generate_summary(diff_text):
    """Call the LLM to produce a structured summary."""
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Here is the diff:\n\n{diff_text}"}
        ],
        temperature=0.3,
        max_tokens=500
    )
    return resp.choices[0].message.content

def post_comment(payload, summary):
    """Post or update the summary comment on the PR."""
    headers = {"Authorization": f"Bearer {GITHUB_TOKEN}", "Accept": "application/vnd.github.v3+json"}
    comments_url = payload["pull_request"]["comments_url"]
    body = {
        "body": f"## 🤖 AI PR Summary\n\n{summary}\n\n---\n*Generated automatically. Please verify critical changes.*"
    }
    httpx.post(comments_url, json=body, headers=headers, timeout=10.0)

@app.route("/webhook", methods=["POST"])
def webhook():
    payload = request.json
    if payload.get("action") not in ["opened", "synchronize"]:
        return jsonify({"status": "ignored"}), 200
    
    try:
        diff = fetch_diff(payload)
        summary = generate_summary(diff)
        post_comment(payload, summary)
        return jsonify({"status": "ok"}), 200
    except Exception as e:
        # Log to your observability tool
        print(f"PR summarizer failed: {e}")
        return jsonify({"status": "error", "message": str(e)}), 500

That’s it. The heavy lifting is in the prompt design and the diff truncation strategy.

Prompt Engineering: The Secret Sauce

We iterated on the system prompt for about a week. Here’s what we learned:

Be specific about format. If you ask for a “summary” you get a paragraph. If you ask for a structured breakdown with headers, you get something actually useful.

Set a token budget. We cap the summary at 250 words. Any longer and reviewers start skimming instead of reading.

Include a confidence disclaimer. That line at the bottom of the comment? Non-negotiable. AI gets things wrong. Reviewers should never blindly trust a summary.

Handling Edge Cases (Because You Will Hit Them)

In production, we ran into three issues:

Very large diffs. Some PRs touch 50+ files. The full diff can exceed 30k tokens. Our truncation strategy (first 15k characters) isn’t perfect, but it works for 90% of cases. The remaining 10%? Those PRs probably shouldn’t exist anyway.

Binary file changes. Diffs for images or compiled binaries are useless. We added a quick filter to skip non-textual diffs before sending to the LLM.

Race conditions. If a developer pushes twice quickly, the bot might comment twice. We added a dedup check: if the bot’s previous comment exists, edit it instead of creating a new one.

How This Fits Into Our Team Workflow

We deployed this for a client team out of Ho Chi Minh City. They handle a high volume of PRs across multiple repos. The summarizer runs silently in the background.

The effect was immediate. Instead of reading a diff to understand the scope of a PR, reviewers read the summary first. If the summary says “This PR refactors the payment gateway adapter” they already have the mental model loaded. The actual code review becomes verification, not exploration.

That’s the shift you want.

—

Frequently Asked Questions

Q: How much does this cost to run?

A: We use GPT-4o-mini. Each summary costs roughly $0.01 to $0.03 depending on diff size. For a team averaging 20 PRs per day, that’s under $20/month. Cheaper than the coffee the reviewers drink.

Q: Does this work with GitHub Enterprise or self-hosted instances?

A: Yes. You just need to point the webhook URLs at your endpoint instead of GitHub Cloud. The API endpoints are the same for GHE. Just update the `diff_url` and `comments_url` patterns if your instance uses a base URL.

Q: How do I handle diffs that are mostly configuration or generated code?

A: Add a pre-filter. If 70%+ of the diff is in `package-lock.json`, `yarn.lock`, or generated protobuf files, skip the LLM call and post a simple “This PR appears to be dependency updates only” comment. Saves tokens and avoids noise.