GitHub PR review bot powered by Claude
PR review is one of the two LLM use-cases (alongside support) that companies actively spend on in 2026. Building one teaches webhook handling, tool use, prompt iteration against real diffs, and the messy reality of 'is this comment useful?' evals.
Resume bullet (when finished)
“Built a GitHub App that posts Claude-authored PR review comments inline, calling tools to fetch diffs and file context; ran on 120 internal PRs with a 71% reviewer-thumbs-up rate.”
Locked tech stack
No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.
Milestones (6 · ~25h)
- M1~3h
GitHub App + webhook receiver
FastAPI endpoint verifies HMAC signature, parses `pull_request` events.
CHECK BEFORE MOVING ON:
- Why HMAC verification first, before anything else?
- What's the difference between GitHub Apps and OAuth Apps for this use case?
$ git commit -m "feat(bot): webhook receiver with HMAC verification" - M2~3h
Fetch PR diff + context
Use PyGithub to fetch the unified diff + the files touched + the README/CONTRIBUTING for repo context.
CHECK BEFORE MOVING ON:
- Why include CONTRIBUTING.md in context, not just the diff?
- What's the size limit you should enforce and why?
$ git commit -m "feat(context): diff + repo context fetcher" - M3~6h
Claude tool use loop
Two tools: `fetch_file(path)` and `search_code(query)`. Claude can call them while drafting comments.
CHECK BEFORE MOVING ON:
- Why give Claude tools rather than dumping the whole repo into context?
- What's the failure mode if a tool returns 30k tokens?
$ git commit -m "feat(claude): tool use loop with fetch/search" - M4~4h
Post inline review comments
Each Claude finding becomes an inline GitHub review comment via the PullsAPI review endpoint.
CHECK BEFORE MOVING ON:
- Inline vs general review comments — when each?
- What happens if Claude points at a line that no longer exists?
$ git commit -m "feat(review): inline review comments" - M5~5h
Golden eval suite
30 fixture PRs with hand-graded expected comments. Suite gives a 0–1 score per PR. CI runs it on every change to the prompt.
CHECK BEFORE MOVING ON:
- What does 'golden eval' mean for a non-deterministic LLM?
- Why does the prompt need a CI gate?
$ git commit -m "test(eval): 30-PR golden eval suite" - M6~4h
Feedback loop
Reviewers can 👍 / 👎 the bot's comments. Aggregate score lands in a Postgres table read by the eval suite.
CHECK BEFORE MOVING ON:
- Why save the reaction, not just the count?
- What's the right place to surface the score back to the prompt author?
$ git commit -m "feat: thumbs-up/down feedback loop + metrics"
60-second demo storyboard
What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.
- 0-5s: 'GitHub PR bot — Claude reads the diff, calls tools, posts inline comments.'
- 5-25s: open a real PR, watch the bot leave 4 inline comments live.
- 25-45s: show the golden eval suite + CI run gating prompt changes.
- 45-60s: 71% thumbs-up rate, 120 PRs.
STAR talking points for behavioral round
STAR — EVAL DISCIPLINE
Situation: every prompt tweak felt right in spot-checks but I had no idea if I was actually improving. Task: build an offline eval. Action: assembled 30 fixture PRs, hand-graded the ideal comment set, scored each prompt version. Result: I caught a regression on iteration #14 that would have shipped silently — the 'helpful'-feeling prompt scored 0.42 vs the previous 0.61.
STAR — TOOL DESIGN
Situation: Claude was guessing at file contents instead of asking. Task: make 'ask' easier than 'guess'. Action: gave it `fetch_file` + `search_code` and explicitly listed examples of when to use them. Result: hallucinated-context comments dropped from 18% to under 4%.
Production references — how grown-up systems do this
GitHub Copilot →
Copilot for PRs is the consumer-facing example of this exact pattern — the public docs describe the tool-use approach.
Anthropic →
Anthropic's tool use cookbook is the canonical reference for safe agentic loops.
Greptile →
Greptile (YC W24) is the leading 'AI code review' product — their public posts on eval design are required reading.
Self-review rubric (before you claim done)
Correctness
- Webhook signature verification mandatory.
- Reviews post inline on the right line/file.
- Bot never reviews its own commits.
- Failure modes (rate limit, expired token) handled gracefully.
Code quality
- Prompt lives in a versioned file, not a string in code.
- Tool definitions and handlers separated.
- Async throughout — webhooks return in <2s, work done in background.
- Eval suite is committed and reproducible.
Testing
- Golden eval suite committed.
- Webhook handler tested with fixture payloads.
- Integration test exercises the full diff→tool-loop→comment path.
Docs
- README explains the eval methodology.
- Architecture diagram with the tool-use loop.
- Section: 'how would you adapt this for a private GHE instance?'
✱ AI code review
Get a senior-style review before you call it done
Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.
Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.
Need Python first? Start Foundations →