Is CodeMentor AI free?

The first 15 Python lessons are free with no signup and no credit card. After that, the 7-day Pro trial unlocks every track; cancel anytime. Pro is $12/month or $89/year.

Can I learn Python without installing anything?

Yes. Every lesson runs Python in your browser — Skulpt for lightweight lessons and Pyodide (full CPython) on the playground. No Anaconda, no pyenv, no terminal commands. Open the page and hit Run.

Is CodeMentor AI good for complete beginners?

Yes — the Foundations track starts with print('Hello, World!') and assumes zero programming background. The first 15 lessons are free to verify the difficulty curve matches you before any signup.

Does the AI tutor replace a human mentor?

It replaces 80% of "I'm stuck at 21:00 and Stack Overflow scared me" moments. You get hints calibrated to your code + a chat for follow-up questions. For project review and career advice the team also answers support@learnpython.academy directly.

Can I learn Python for an AI engineering job?

Yes. The AI Engineering track covers production patterns the US dev community uses in 2026 — Claude/LLM APIs, tool use, RAG, agent loops, prompt caching, evals, voice agents. Build production AI features end-to-end.

Are the courses available in languages other than English?

Yes — the platform UI and most lessons are translated into 18 languages including Ukrainian, Russian, Polish, German, French, Spanish, Portuguese, and more. Pick yours in the language switcher.

← All projects

L4Backend · infra· 30-50h total

Distributed task queue with retries, dead-letter, and metrics

Every non-trivial backend has 'do this thing eventually' — emails, image processing, retries. Building a real queue (not 'I once used celery_send_task') teaches idempotency, fan-out vs fan-in, retry semantics, observability, and the messy reality of jobs that fail in production at 3am.

Resume bullet (when finished)

“Built a Celery-based distributed task queue running on 4 worker nodes processing 12k jobs/min, with idempotent retries, dead-letter routing, Prometheus metrics, and a flake-investigation runbook used in real incidents.”

Locked tech stack

No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.

Python 3.12CeleryRedisPostgreSQLFlowerPrometheusDocker Compose

Milestones (6 · ~30h)

M1~4h
Celery skeleton + first job
FastAPI enqueues, Celery worker consumes. Compose stack: API, worker, Redis, Postgres.
CHECK BEFORE MOVING ON:
- Why Celery over RQ or Dramatiq here?
- Where does the broker live and why?
$ git commit -m "feat(queue): Celery + first job"
M2~5h
Idempotent retries
Jobs accept a `job_id`; the worker uses it as a Postgres uniqueness key. Re-running a retried job is a no-op.
CHECK BEFORE MOVING ON:
- Why idempotency keys instead of 'check before write'?
- What's at-least-once vs exactly-once and which do you really have?
$ git commit -m "feat: idempotency keys + safe retries"
M3~5h
Exponential backoff + max retries
Failed jobs retry at 2s, 4s, 8s, 16s, 32s with jitter. After 5 fails → dead-letter.
CHECK BEFORE MOVING ON:
- Why jitter and not pure exponential?
- What's the trade-off in choosing max-retries?
$ git commit -m "feat: exponential backoff with jitter + DLQ"
M4~5h
Dead-letter routing + replay tool
DLQ stored in Postgres. CLI tool `dlq-tool list / replay / drop`. Reasons captured.
CHECK BEFORE MOVING ON:
- Why store the DLQ in Postgres and not Redis?
- What's the difference between 'replay' and 'drop'?
$ git commit -m "feat(dlq): Postgres DLQ + replay CLI"
M5~6h
Prometheus + Grafana dashboard
Throughput, error rate, retry rate, DLQ depth, P95 job duration. Dashboard JSON committed.
CHECK BEFORE MOVING ON:
- Which metric do you wake up at 3am for?
- What's the difference between a counter, a gauge, and a histogram?
$ git commit -m "ops: Prometheus + Grafana dashboard"
M6~5h
Runbook + load test
12k jobs/min on 4 workers. Runbook for 'queue depth growing', 'worker crashloop', 'DLQ flooding'.
CHECK BEFORE MOVING ON:
- Why does the runbook live in the repo, not Confluence?
- What's the first command you run when queue depth is growing?
$ git commit -m "docs: runbook + load test results"

60-second demo storyboard

What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.

0-5s: 'Distributed task queue: 12k jobs/min, idempotent retries, DLQ + replay.'
5-25s: live demo — submit 1000 jobs, watch the Grafana dashboard light up.
25-45s: intentionally crash a worker, show retries on the dashboard, no jobs lost.
45-60s: walk through one DLQ entry and the replay command.

STAR talking points for behavioral round

STAR — IDEMPOTENCY

Situation: a payment-side job was being retried by Celery on transient DB failures — and the 2nd run was double-charging users. Task: make every job safe to retry. Action: every job started accepting an idempotency_key, recorded the (key, outcome) tuple in a Postgres table, and returned the cached outcome on duplicates. Result: zero double-charges in the next 90 days, even with a daily-rotated set of forced-retry chaos jobs.

STAR — OBSERVABILITY

Situation: queue depth grew from ~50 to ~30k overnight and no one noticed. Task: make 'depth' a paging signal. Action: added a Prometheus gauge for queue depth, a Grafana alert on rate > X per minute, and a runbook entry. Result: the next slow-consumer issue was detected and triaged in 18 minutes instead of overnight.

Production references — how grown-up systems do this

Celery →

Celery's official docs are the source of truth for routing, retries, and acks-late semantics.

Shopify →

Shopify's blog on idempotency keys is the canonical reference — same shape used here.

Honeycomb →

Honeycomb's pieces on production observability are excellent reading for the metrics + runbook discipline this project teaches.

Self-review rubric (before you claim done)

Correctness

Every job idempotent by key.
Retries respect backoff + jitter.
DLQ never loses a row.
Replay re-uses the original idempotency key.

Code quality

Tasks live in a `tasks/` package, not in routes.
Backoff config is data, not code.
No `time.sleep` in workers — `self.retry(countdown=…)` only.
Logs include job_id + idempotency_key on every line.

Testing

Property test: 'idempotent under N random retries'.
Chaos test: kill workers mid-job, verify nothing lost.
DLQ tool tests cover list/replay/drop.

Docs

Runbook is a markdown file at /docs/runbook.md.
Dashboard JSON committed.
Architecture diagram + 'how would this fail in multi-region?' section.

✱ AI code review

Get a senior-style review before you call it done

Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.

Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.

Need Python first? Start Foundations →