AI Telegram bot with conversation memory & tool use
AI Engineering is one of the fastest-growing junior areas right now, and the entry signal teams look for is 'has shipped a real LLM-backed system, not a demo'. This project covers the whole stack — API integration, memory, tools, deployment, evals — in one resume bullet.
Resume bullet (when finished)
“Shipped a production AI Telegram bot with multi-turn conversation memory, tool use (web search + calculator), Redis persistence, webhook deployment, and a small eval suite. 99.9% uptime over the first month.”
Locked tech stack
No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.
Milestones (7 · ~29h)
- M1~3h
Local echo bot — Telegram webhook reaches FastAPI
Set up a bot via @BotFather. ngrok tunnels localhost. /api/telegram/webhook receives Update payloads and echoes the text back.
CHECK BEFORE MOVING ON:
- What's the difference between long-polling and webhooks for a Telegram bot?
- Why must your webhook always respond <5s, even if the LLM call takes longer?
$ git commit -m "feat: telegram webhook receiver + echo behaviour" - M2~3h
Single-turn Claude call
Replace the echo with a Claude API call (claude-opus-4-6, streaming off for simplicity). Reply text comes back to Telegram via the bot API.
CHECK BEFORE MOVING ON:
- Why call the API server-side instead of from the user's device?
- What's `max_tokens` for and what happens if you set it too high?
$ git commit -m "feat(ai): single-turn Claude response" - M3~5h
Multi-turn memory via Redis
Store conversation history keyed by `chat_id` in Redis. Pass last 10 messages (or last 4000 tokens — whichever is smaller) on each turn.
CHECK BEFORE MOVING ON:
- Why does the system message live OUTSIDE the rolling history?
- What's the failure mode if you forget to truncate history?
$ git commit -m "feat(memory): redis-backed multi-turn conversation context" - M4~6h
Tool use — calculator + web search
Declare two tools (`calculator(expression: str)`, `web_search(query: str)`). On `tool_use` stop reason, execute, append `tool_result`, loop until the model emits text.
CHECK BEFORE MOVING ON:
- Why does the spec require you to send tool_result back in the SAME ordering as tool_use blocks?
- What's the safety risk of letting Claude call a calculator that uses `eval()` directly?
$ git commit -m "feat(tools): calculator + web_search tool use loop" - M5~3h
Rate limit + cost cap per chat_id
5 messages/min per chat. 200 messages/day per chat. Friendly throttle message when hit. Each chat capped at $0.20/day in API spend (rough token counting).
CHECK BEFORE MOVING ON:
- Why per-chat rate limits matter even for an internal bot — what's the abuse vector?
- What's a cheap way to estimate Claude input/output tokens without a tokenizer dep?
$ git commit -m "feat: per-chat rate limit + daily cost cap" - M6~5h
Eval suite — 30 golden conversations
Build a `golden.jsonl` of 30 hand-crafted prompts with `must_include` / `must_not_include` checks. CI runs the eval; merge blocked on regression.
CHECK BEFORE MOVING ON:
- Why eval BEFORE you optimize the prompt, not after?
- What's a good mix of happy-path / edge / adversarial in a 30-item eval set?
$ git commit -m "test: 30-item golden eval suite + CI gate" - M7~4h
Deploy to Fly.io with secrets + uptime probe
Multi-stage Dockerfile, fly.toml with health check on /health, Telegram webhook re-registered to the public URL. UptimeRobot pings /health every 5 min.
CHECK BEFORE MOVING ON:
- Why store the Anthropic key in `fly secrets set` instead of fly.toml env?
- What's a good action when /health flaps but uptime says 99.9%?
$ git commit -m "ops: fly.io deploy with secrets + uptime probe"
60-second demo storyboard
What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.
- 0-5s: 'I built an AI Telegram bot — conversation memory, tools, evals, the works.'
- 5-15s: Send it 3 messages in a Telegram clip — it remembers context from message 1 in message 3.
- 15-30s: 'Calculate 27 * 31 + sqrt(144).' — show tool_use roundtrip, the bot returns 849.
- 30-45s: Show the eval suite running in CI — 28/30 passing, the 2 failing are flagged for review.
- 45-55s: One architectural decision (e.g. 'I cap each chat at $0.20/day because…') in plain English.
- 55-60s: 'Repo + deployment URL. Would love your feedback on the tool-loop error handling.'
STAR talking points for behavioral round
STAR — PRODUCTION INCIDENT
Situation: bot started replying with 'I cannot help with that' to ~10% of valid requests. Task: figure out why. Action: added per-message logging of full Claude response, found the safety classifier was firing on a system-prompt phrase ('act as'). Result: rewrote the system message in instructional tone, false-refusal rate dropped from 10% to under 1%, eval suite caught the regression next iteration.
STAR — DESIGN TRADE-OFF
Situation: had to choose between trimming history by message count vs by token count. Task: pick the right one. Action: chose token count with a 4000-token budget. Reason: long messages from one user shouldn't push out 20 short messages from someone else. Result: more even cost per chat, no truncation surprises in high-volume threads.
STAR — EVALUATION DISCIPLINE
Situation: a prompt change felt better in manual testing but I wasn't sure. Task: prove it objectively. Action: ran both prompts through the 30-item golden eval, compared per-item scores. Result: the 'better' prompt actually regressed on 4 edge cases — caught before deploy. Lesson: evals catch what intuition misses.
Production references — how grown-up systems do this
Anthropic →
Anthropic's tool use docs are the canonical reference for the request/response loop and stop_reason handling.
Telegram →
Bot API docs — read the Webhook section twice, it's the source of 80% of first-time-bot bugs.
Vercel AI SDK →
Different language, same shape — Vercel's AI SDK documents the multi-turn + tool-use pattern in a way that maps cleanly to your Python code.
Self-review rubric (before you claim done)
Correctness
- Bot remembers the last N user turns and demonstrably uses them.
- Tool use loops until `stop_reason: end_turn`; partial tool_use never reaches the user.
- Rate limiter triggers cleanly (visible 'slow down' message) at 6 messages/min.
- Eval suite passes ≥27/30 on the latest commit; CI fails on regression.
Code quality
- System message + tool schemas in a separate module — not inline in the handler.
- Anthropic SDK calls wrapped with retries on 529 (overloaded) and rate-limit responses.
- All env var access goes through a single config module with validation.
- Per-chat metric counters (messages, tokens, cost) — observable in logs.
Testing
- Golden eval suite committed; running it locally takes <2 minutes.
- At least 2 unit tests for the tool-execution layer (calculator happy + bad-input).
- Mock Claude responses in unit tests — no real API calls in CI.
- Integration test exercises the full webhook → Claude → tool → reply roundtrip with a fake Telegram client.
Docs
- README explains how to provision a bot via @BotFather and where each secret comes from.
- Architecture diagram: Telegram → webhook → FastAPI → Claude + Redis → Telegram.
- Three design decisions written up in plain English (history strategy, rate-limit math, deploy choice).
- Eval suite README explains how to add a new golden item.
✱ AI code review
Get a senior-style review before you call it done
Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.
Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.
Need Python first? Start Foundations →