Skip to main content
← All projects
L3AI Engineering · Developer tools· 18-30h total

GitHub PR review bot powered by Claude

PR review is one of the two LLM use-cases (alongside support) that companies actively spend on in 2026. Building one teaches webhook handling, tool use, prompt iteration against real diffs, and the messy reality of 'is this comment useful?' evals.

Resume bullet (when finished)

Built a GitHub App that posts Claude-authored PR review comments inline, calling tools to fetch diffs and file context; ran on 120 internal PRs with a 71% reviewer-thumbs-up rate.

Locked tech stack

No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.

Python 3.12FastAPIAnthropic SDKPyGithubPostgreSQLDockerGitHub Webhooks

Milestones (6 · ~25h)

  1. M1~3h

    GitHub App + webhook receiver

    FastAPI endpoint verifies HMAC signature, parses `pull_request` events.

    CHECK BEFORE MOVING ON:

    • Why HMAC verification first, before anything else?
    • What's the difference between GitHub Apps and OAuth Apps for this use case?
    $ git commit -m "feat(bot): webhook receiver with HMAC verification"
  2. M2~3h

    Fetch PR diff + context

    Use PyGithub to fetch the unified diff + the files touched + the README/CONTRIBUTING for repo context.

    CHECK BEFORE MOVING ON:

    • Why include CONTRIBUTING.md in context, not just the diff?
    • What's the size limit you should enforce and why?
    $ git commit -m "feat(context): diff + repo context fetcher"
  3. M3~6h

    Claude tool use loop

    Two tools: `fetch_file(path)` and `search_code(query)`. Claude can call them while drafting comments.

    CHECK BEFORE MOVING ON:

    • Why give Claude tools rather than dumping the whole repo into context?
    • What's the failure mode if a tool returns 30k tokens?
    $ git commit -m "feat(claude): tool use loop with fetch/search"
  4. M4~4h

    Post inline review comments

    Each Claude finding becomes an inline GitHub review comment via the PullsAPI review endpoint.

    CHECK BEFORE MOVING ON:

    • Inline vs general review comments — when each?
    • What happens if Claude points at a line that no longer exists?
    $ git commit -m "feat(review): inline review comments"
  5. M5~5h

    Golden eval suite

    30 fixture PRs with hand-graded expected comments. Suite gives a 0–1 score per PR. CI runs it on every change to the prompt.

    CHECK BEFORE MOVING ON:

    • What does 'golden eval' mean for a non-deterministic LLM?
    • Why does the prompt need a CI gate?
    $ git commit -m "test(eval): 30-PR golden eval suite"
  6. M6~4h

    Feedback loop

    Reviewers can 👍 / 👎 the bot's comments. Aggregate score lands in a Postgres table read by the eval suite.

    CHECK BEFORE MOVING ON:

    • Why save the reaction, not just the count?
    • What's the right place to surface the score back to the prompt author?
    $ git commit -m "feat: thumbs-up/down feedback loop + metrics"

60-second demo storyboard

What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.

  1. 0-5s: 'GitHub PR bot — Claude reads the diff, calls tools, posts inline comments.'
  2. 5-25s: open a real PR, watch the bot leave 4 inline comments live.
  3. 25-45s: show the golden eval suite + CI run gating prompt changes.
  4. 45-60s: 71% thumbs-up rate, 120 PRs.

STAR talking points for behavioral round

STAR — EVAL DISCIPLINE

Situation: every prompt tweak felt right in spot-checks but I had no idea if I was actually improving. Task: build an offline eval. Action: assembled 30 fixture PRs, hand-graded the ideal comment set, scored each prompt version. Result: I caught a regression on iteration #14 that would have shipped silently — the 'helpful'-feeling prompt scored 0.42 vs the previous 0.61.

STAR — TOOL DESIGN

Situation: Claude was guessing at file contents instead of asking. Task: make 'ask' easier than 'guess'. Action: gave it `fetch_file` + `search_code` and explicitly listed examples of when to use them. Result: hallucinated-context comments dropped from 18% to under 4%.

Production references — how grown-up systems do this

GitHub Copilot

Copilot for PRs is the consumer-facing example of this exact pattern — the public docs describe the tool-use approach.

Anthropic

Anthropic's tool use cookbook is the canonical reference for safe agentic loops.

Greptile

Greptile (YC W24) is the leading 'AI code review' product — their public posts on eval design are required reading.

Self-review rubric (before you claim done)

Correctness

  • Webhook signature verification mandatory.
  • Reviews post inline on the right line/file.
  • Bot never reviews its own commits.
  • Failure modes (rate limit, expired token) handled gracefully.

Code quality

  • Prompt lives in a versioned file, not a string in code.
  • Tool definitions and handlers separated.
  • Async throughout — webhooks return in <2s, work done in background.
  • Eval suite is committed and reproducible.

Testing

  • Golden eval suite committed.
  • Webhook handler tested with fixture payloads.
  • Integration test exercises the full diff→tool-loop→comment path.

Docs

  • README explains the eval methodology.
  • Architecture diagram with the tool-use loop.
  • Section: 'how would you adapt this for a private GHE instance?'

✱ AI code review

Get a senior-style review before you call it done

Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.

Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.

Need Python first? Start Foundations →