We Built a Multi-Agent Open Source Bot on GitHub Actions — Here’s the Exact Architecture
Open source maintenance is a thankless job. You ship code, but you spend your weekends triaging issues, reviewing PRs from strangers, and updating docs nobody reads. It’s burnout fuel.
We got tired of it. So we built a bot.
How a Vietnam-Based Team Slashed Our Development Costs by 40% (And Actually Delivered on Time)
TL;DR: A mid-sized e‑commerce company replaced their expensive local agency with a Vietnamese development team orchestrated by the… ...
Not a simple auto-responder. Not a Dependabot clone. We built a three-agent system that runs entirely on GitHub Actions, costs nothing in infrastructure, and handles about 80% of the busywork on a 5,000-star open source project we maintain. It’s been running for six months. Here’s exactly how it works.
The Problem: OSS Maintenance Doesn’t Scale
Let’s be real. Most open source projects die because the maintainer runs out of energy, not because the code is bad. You start with enthusiasm. Six months later, you have 47 open issues, 12 stale PRs, and a sinking feeling every time you open GitHub.
I Benchmarked 5 Multi-Agent Orchestration Frameworks on a Real Logistics Pipeline — Here’s What Actually Survived Production
I Benchmarked 5 Multi-Agent Orchestration Frameworks on a Real Logistics Pipeline — Here’s What Actually Survived Production Let’s… ...
I’ve been there. Our project, a Python data pipeline library, hit 5,000 stars last year. The community grew. So did the noise. We needed a way to:
- Triage issues without reading every single one
- Review PRs for obvious problems before a human looks at them
- Keep docs in sync with code changes
A single automation script wasn’t enough. Each task requires different context, different tools, different failure modes. We needed agents.
The Architecture: Three Specialized Agents, One Workflow
Here’s the high-level design. It’s not fancy. It works.
┌─────────────────────────────────────────────────────┐
│ GitHub Event │
│ (Issue Opened, PR Created, PR Merged) │
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ GitHub Actions Workflow │
│ (Orchestrator - routes to correct agent) │
└──────┬─────────────────────┬────────────────┬───────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Agent 1 │ │ Agent 2 │ │ Agent 3 │
│ Issue │ │ PR Review │ │ Docs Sync │
│ Triage │ │ │ │ │
└─────────────┘ └──────────────┘ └──────────────┘
Each agent is a Python script that calls OpenAI’s API (GPT-4o-mini for cost, GPT-4o for complex reviews). They share no state. They’re stateless, idempotent, and cheap. Each invocation costs about $0.01 to $0.05.
Agent 1: The Issue Triage Bot
This one handles the highest volume. Every new issue triggers it. The agent reads the issue body, labels it, and either asks for more info or closes it as a duplicate.
Here’s the core logic:
python
import openai
from github import Github
def triage_issue(issue, repo):
prompt = f"""
You are a senior maintainer for the open source project '{repo.name}'.
Analyze this issue and respond with a JSON object:
{{
"labels": ["bug", "needs-repro"],
"confidence": 0.85,
"action": "ask_for_repro",
"comment": "Thanks for reporting! Could you provide a minimal reproduction script?"
}}
Issue title: {issue.title}
Issue body: {issue.body[:2000]}
Rules:
- If it's a feature request, label "enhancement"
- If it's a bug without reproduction steps, label "needs-repro" and ask
- If it's clearly a duplicate of common issues, label "duplicate" and link
- If it's a support question, label "question" and redirect to Discord
- Confidence below 0.7 means flag for human review
"""
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
if result["confidence"] >= 0.7:
issue.add_to_labels(*result["labels"])
issue.create_comment(result["comment"])
if result["action"] == "close":
issue.edit(state="closed")
else:
issue.add_to_labels("needs-human-review")
issue.create_comment("Flagged for human review due to low confidence.")
Honestly, the confidence threshold was the hardest part. We started at 0.9, but it flagged almost everything. Dropped it to 0.7. That’s the sweet spot for a 5K-star repo. You’ll need to tune yours.
Results after 3 months:
- 62% of issues resolved without human intervention
- Average time to first response: 4 minutes (was 14 hours)
- False positive rate: 8%
Agent 2: The PR Review Bot
This one’s trickier. PRs have code, and code is nuanced. We don’t let it approve or merge. That’s still human territory. But it catches the low-hanging fruit.
yaml
# .github/workflows/pr-review.yml
name: PR Review Agent
on:
pull_request:
types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run PR Review
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
pip install openai PyGithub
python pr_reviewer.py
The Python script grabs the diff, sends it to GPT-4o (not the mini model—code review needs the big brain), and posts inline comments.
What it checks for:
- Missing error handling
- Hardcoded secrets or API keys
- Logic that contradicts existing tests
- Performance red flags (nested loops in hot paths)
It doesn’t check style. That’s what linters are for. It focuses on *semantic* problems.
We had a PR last week where a contributor tried to add a retry loop with exponential backoff but forgot the jitter. The bot caught it, commented with a code suggestion, and the contributor fixed it in 10 minutes. That’s a win.
Stats:
- 34% of PRs get at least one suggestion
- 70% of suggestions are accepted by contributors
- Average review time: 12 seconds
Agent 3: The Docs Sync Bot
This one runs on every merge to `main`. It checks if the PR changed any public API surface (function signatures, class names, new modules) and updates the corresponding docs.
python
def check_api_changes(diff):
prompt = f"""
Analyze this git diff and extract all public API changes.
Return a JSON array of changes:
[
{{
"type": "new_function",
"name": "process_batch",
"file": "src/processor.py",
"signature": "def process_batch(items: list, timeout: int = 30) -> dict",
"doc_needed": true
}}
]
Diff:
{diff[:3000]}
"""
# ... API call and parsing ...
If it detects changes, it opens a new PR with the doc updates. Human reviews that PR. We don’t auto-merge docs either. That’s where hallucinations bite you.
The Workflow Orchestrator
All three agents are triggered by a single GitHub Actions workflow file. The key is using `types` to route correctly:
yaml
name: OSS Maintenance Bot
on:
issues:
types: [opened]
pull_request:
types: [opened, synchronize]
push:
branches: [main]
jobs:
triage:
if: github.event_name == 'issues'
runs-on: ubuntu-latest
steps:
- run: python agents/triage.py
review:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
steps:
- run: python agents/review.py
docs:
if: github.event_name == 'push'
runs-on: ubuntu-latest
steps:
- run: python agents/docs_sync.py
Simple. No complex orchestration framework. No Kubernetes. Just YAML and Python.
What We Learned (The Hard Way)
1. Rate limits are real. GitHub Actions API calls count toward your rate limit. We hit it twice in the first week. Solution: add a 2-second delay between API calls and use a personal access token with higher limits.
2. GPT-4o-mini is great for triage, terrible for code. We tried using the cheap model for PR review. It missed obvious bugs. Don’t skimp on code review. Use the full model.
3. You need a human-in-the-loop for anything with write access. The triage bot once labeled a critical security vulnerability as “question” because the reporter wrote it poorly. Confidence was 0.65, so it flagged for human review. Good thing.
4. Context window matters. Issue bodies can be huge. We truncate to 2000 characters. PR diffs we truncate to 3000 lines. If the diff is bigger, the bot skips it and comments “Diff too large for automated review.”
Why This Matters for Your Team
You don’t need a dedicated DevOps person to run this. You don’t need a multi-agent platform. You need:
- A GitHub Actions workflow
- A Python script per agent
- An OpenAI API key
Total cost: about $20/month for a moderately active repo.
But here’s the thing—this approach works because each agent has a narrow, well-defined job. That’s the secret to multi-agent systems. Not swarms. Not complex routing. Just focused agents that do one thing well.
At ECOA AI, we’ve been applying this same pattern for client projects. We have a team of senior developers in Ho Chi Minh City who specialize in building these kinds of automation pipelines. They’ve taken this open source bot pattern and adapted it for internal CI/CD, customer support triage, and even automated documentation generation for enterprise clients.
The Stack Summary
| Component | Technology | Cost |
|---|---|---|
| Orchestration | GitHub Actions | Free |
| Agent runtime | Python 3.12 | Free |
| LLM for triage | GPT-4o-mini | ~$0.01/call |
| LLM for code review | GPT-4o | ~$0.05/call |
| GitHub API | PyGithub | Free |
| Secret management | GitHub Secrets | Free |
Give It a Try
Clone our open source bot template (yes, we open-sourced the bot that maintains our open source project—meta, I know). Modify the prompts for your repo. Tune the thresholds. Watch your weekends free up.
One last thing: don’t try to automate everything. Keep the human in the loop for merges, sensitive issues, and anything that requires taste. The bot handles the boring stuff. You handle the interesting stuff.
—
Frequently Asked Questions
Q: Can I run this with a local LLM instead of OpenAI?
Yes, but expect worse results for code review. We tested with Llama 3.1 70B via Ollama. It worked for triage but missed subtle bugs in PRs. For cost-sensitive setups, use GPT-4o-mini for triage and local models for docs sync. Keep GPT-4o for code review.
Q: How do I handle repos with multiple languages?
The bot is language-agnostic. The prompts don’t reference specific languages. But the code review agent works best on Python, JavaScript, and Go. For niche languages like Elixir or Rust, you’ll need to add language-specific rules to the prompt.
Q: What happens if the bot makes a mistake?
Every action is reversible. The triage bot only adds labels and comments—it never deletes issues. The PR bot only comments, never approves. The docs bot opens a PR that requires human review. We designed it so the worst case is a wrong label or a bad comment, not corrupted data.
Q: How do I prevent the bot from going over OpenAI rate limits?
Use `tenacity` for retries with exponential backoff. Set `max_retries=3` and `min_delay=1`. Also, cache responses for identical inputs using a simple SQLite database. We cache triage results for 24 hours to avoid re-processing the same issue.
Related reading: Outsourcing Software the Right Way: A CTO’s Playbook for Offshore Engineering in 2025
Related reading: Hire Vietnamese Developers: The Offshore Strategy That Beats India, China & Philippines