Skip to main content
← All projects
L3AI Engineering· 25-40h total

AI Telegram bot with conversation memory & tool use

AI Engineering is one of the fastest-growing junior areas right now, and the entry signal teams look for is 'has shipped a real LLM-backed system, not a demo'. This project covers the whole stack — API integration, memory, tools, deployment, evals — in one resume bullet.

Open in GitHub Codespaces· free 60h/moOpen in Gitpod

Resume bullet (when finished)

Shipped a production AI Telegram bot with multi-turn conversation memory, tool use (web search + calculator), Redis persistence, webhook deployment, and a small eval suite. 99.9% uptime over the first month.

Locked tech stack

No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.

Python 3.12Anthropic SDK (Claude Opus 4.6)python-telegram-bot v21Redis (conversation memory + rate limit)FastAPI (webhook receiver)Fly.io (deploy)pytest + httpx

Milestones (7 · ~29h)

  1. M1~3h

    Local echo bot — Telegram webhook reaches FastAPI

    Set up a bot via @BotFather. ngrok tunnels localhost. /api/telegram/webhook receives Update payloads and echoes the text back.

    CHECK BEFORE MOVING ON:

    • What's the difference between long-polling and webhooks for a Telegram bot?
    • Why must your webhook always respond <5s, even if the LLM call takes longer?
    $ git commit -m "feat: telegram webhook receiver + echo behaviour"
  2. M2~3h

    Single-turn Claude call

    Replace the echo with a Claude API call (claude-opus-4-6, streaming off for simplicity). Reply text comes back to Telegram via the bot API.

    CHECK BEFORE MOVING ON:

    • Why call the API server-side instead of from the user's device?
    • What's `max_tokens` for and what happens if you set it too high?
    $ git commit -m "feat(ai): single-turn Claude response"
  3. M3~5h

    Multi-turn memory via Redis

    Store conversation history keyed by `chat_id` in Redis. Pass last 10 messages (or last 4000 tokens — whichever is smaller) on each turn.

    CHECK BEFORE MOVING ON:

    • Why does the system message live OUTSIDE the rolling history?
    • What's the failure mode if you forget to truncate history?
    $ git commit -m "feat(memory): redis-backed multi-turn conversation context"
  4. M4~6h

    Tool use — calculator + web search

    Declare two tools (`calculator(expression: str)`, `web_search(query: str)`). On `tool_use` stop reason, execute, append `tool_result`, loop until the model emits text.

    CHECK BEFORE MOVING ON:

    • Why does the spec require you to send tool_result back in the SAME ordering as tool_use blocks?
    • What's the safety risk of letting Claude call a calculator that uses `eval()` directly?
    $ git commit -m "feat(tools): calculator + web_search tool use loop"
  5. M5~3h

    Rate limit + cost cap per chat_id

    5 messages/min per chat. 200 messages/day per chat. Friendly throttle message when hit. Each chat capped at $0.20/day in API spend (rough token counting).

    CHECK BEFORE MOVING ON:

    • Why per-chat rate limits matter even for an internal bot — what's the abuse vector?
    • What's a cheap way to estimate Claude input/output tokens without a tokenizer dep?
    $ git commit -m "feat: per-chat rate limit + daily cost cap"
  6. M6~5h

    Eval suite — 30 golden conversations

    Build a `golden.jsonl` of 30 hand-crafted prompts with `must_include` / `must_not_include` checks. CI runs the eval; merge blocked on regression.

    CHECK BEFORE MOVING ON:

    • Why eval BEFORE you optimize the prompt, not after?
    • What's a good mix of happy-path / edge / adversarial in a 30-item eval set?
    $ git commit -m "test: 30-item golden eval suite + CI gate"
  7. M7~4h

    Deploy to Fly.io with secrets + uptime probe

    Multi-stage Dockerfile, fly.toml with health check on /health, Telegram webhook re-registered to the public URL. UptimeRobot pings /health every 5 min.

    CHECK BEFORE MOVING ON:

    • Why store the Anthropic key in `fly secrets set` instead of fly.toml env?
    • What's a good action when /health flaps but uptime says 99.9%?
    $ git commit -m "ops: fly.io deploy with secrets + uptime probe"

60-second demo storyboard

What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.

  1. 0-5s: 'I built an AI Telegram bot — conversation memory, tools, evals, the works.'
  2. 5-15s: Send it 3 messages in a Telegram clip — it remembers context from message 1 in message 3.
  3. 15-30s: 'Calculate 27 * 31 + sqrt(144).' — show tool_use roundtrip, the bot returns 849.
  4. 30-45s: Show the eval suite running in CI — 28/30 passing, the 2 failing are flagged for review.
  5. 45-55s: One architectural decision (e.g. 'I cap each chat at $0.20/day because…') in plain English.
  6. 55-60s: 'Repo + deployment URL. Would love your feedback on the tool-loop error handling.'

STAR talking points for behavioral round

STAR — PRODUCTION INCIDENT

Situation: bot started replying with 'I cannot help with that' to ~10% of valid requests. Task: figure out why. Action: added per-message logging of full Claude response, found the safety classifier was firing on a system-prompt phrase ('act as'). Result: rewrote the system message in instructional tone, false-refusal rate dropped from 10% to under 1%, eval suite caught the regression next iteration.

STAR — DESIGN TRADE-OFF

Situation: had to choose between trimming history by message count vs by token count. Task: pick the right one. Action: chose token count with a 4000-token budget. Reason: long messages from one user shouldn't push out 20 short messages from someone else. Result: more even cost per chat, no truncation surprises in high-volume threads.

STAR — EVALUATION DISCIPLINE

Situation: a prompt change felt better in manual testing but I wasn't sure. Task: prove it objectively. Action: ran both prompts through the 30-item golden eval, compared per-item scores. Result: the 'better' prompt actually regressed on 4 edge cases — caught before deploy. Lesson: evals catch what intuition misses.

Production references — how grown-up systems do this

Anthropic

Anthropic's tool use docs are the canonical reference for the request/response loop and stop_reason handling.

Telegram

Bot API docs — read the Webhook section twice, it's the source of 80% of first-time-bot bugs.

Vercel AI SDK

Different language, same shape — Vercel's AI SDK documents the multi-turn + tool-use pattern in a way that maps cleanly to your Python code.

Self-review rubric (before you claim done)

Correctness

  • Bot remembers the last N user turns and demonstrably uses them.
  • Tool use loops until `stop_reason: end_turn`; partial tool_use never reaches the user.
  • Rate limiter triggers cleanly (visible 'slow down' message) at 6 messages/min.
  • Eval suite passes ≥27/30 on the latest commit; CI fails on regression.

Code quality

  • System message + tool schemas in a separate module — not inline in the handler.
  • Anthropic SDK calls wrapped with retries on 529 (overloaded) and rate-limit responses.
  • All env var access goes through a single config module with validation.
  • Per-chat metric counters (messages, tokens, cost) — observable in logs.

Testing

  • Golden eval suite committed; running it locally takes <2 minutes.
  • At least 2 unit tests for the tool-execution layer (calculator happy + bad-input).
  • Mock Claude responses in unit tests — no real API calls in CI.
  • Integration test exercises the full webhook → Claude → tool → reply roundtrip with a fake Telegram client.

Docs

  • README explains how to provision a bot via @BotFather and where each secret comes from.
  • Architecture diagram: Telegram → webhook → FastAPI → Claude + Redis → Telegram.
  • Three design decisions written up in plain English (history strategy, rate-limit math, deploy choice).
  • Eval suite README explains how to add a new golden item.

✱ AI code review

Get a senior-style review before you call it done

Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.

Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.

Need Python first? Start Foundations →