Distributed task queue with retries, dead-letter, and metrics
Every non-trivial backend has 'do this thing eventually' — emails, image processing, retries. Building a real queue (not 'I once used celery_send_task') teaches idempotency, fan-out vs fan-in, retry semantics, observability, and the messy reality of jobs that fail in production at 3am.
Resume bullet (when finished)
“Built a Celery-based distributed task queue running on 4 worker nodes processing 12k jobs/min, with idempotent retries, dead-letter routing, Prometheus metrics, and a flake-investigation runbook used in real incidents.”
Locked tech stack
No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.
Milestones (6 · ~30h)
- M1~4h
Celery skeleton + first job
FastAPI enqueues, Celery worker consumes. Compose stack: API, worker, Redis, Postgres.
CHECK BEFORE MOVING ON:
- Why Celery over RQ or Dramatiq here?
- Where does the broker live and why?
$ git commit -m "feat(queue): Celery + first job" - M2~5h
Idempotent retries
Jobs accept a `job_id`; the worker uses it as a Postgres uniqueness key. Re-running a retried job is a no-op.
CHECK BEFORE MOVING ON:
- Why idempotency keys instead of 'check before write'?
- What's at-least-once vs exactly-once and which do you really have?
$ git commit -m "feat: idempotency keys + safe retries" - M3~5h
Exponential backoff + max retries
Failed jobs retry at 2s, 4s, 8s, 16s, 32s with jitter. After 5 fails → dead-letter.
CHECK BEFORE MOVING ON:
- Why jitter and not pure exponential?
- What's the trade-off in choosing max-retries?
$ git commit -m "feat: exponential backoff with jitter + DLQ" - M4~5h
Dead-letter routing + replay tool
DLQ stored in Postgres. CLI tool `dlq-tool list / replay / drop`. Reasons captured.
CHECK BEFORE MOVING ON:
- Why store the DLQ in Postgres and not Redis?
- What's the difference between 'replay' and 'drop'?
$ git commit -m "feat(dlq): Postgres DLQ + replay CLI" - M5~6h
Prometheus + Grafana dashboard
Throughput, error rate, retry rate, DLQ depth, P95 job duration. Dashboard JSON committed.
CHECK BEFORE MOVING ON:
- Which metric do you wake up at 3am for?
- What's the difference between a counter, a gauge, and a histogram?
$ git commit -m "ops: Prometheus + Grafana dashboard" - M6~5h
Runbook + load test
12k jobs/min on 4 workers. Runbook for 'queue depth growing', 'worker crashloop', 'DLQ flooding'.
CHECK BEFORE MOVING ON:
- Why does the runbook live in the repo, not Confluence?
- What's the first command you run when queue depth is growing?
$ git commit -m "docs: runbook + load test results"
60-second demo storyboard
What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.
- 0-5s: 'Distributed task queue: 12k jobs/min, idempotent retries, DLQ + replay.'
- 5-25s: live demo — submit 1000 jobs, watch the Grafana dashboard light up.
- 25-45s: intentionally crash a worker, show retries on the dashboard, no jobs lost.
- 45-60s: walk through one DLQ entry and the replay command.
STAR talking points for behavioral round
STAR — IDEMPOTENCY
Situation: a payment-side job was being retried by Celery on transient DB failures — and the 2nd run was double-charging users. Task: make every job safe to retry. Action: every job started accepting an idempotency_key, recorded the (key, outcome) tuple in a Postgres table, and returned the cached outcome on duplicates. Result: zero double-charges in the next 90 days, even with a daily-rotated set of forced-retry chaos jobs.
STAR — OBSERVABILITY
Situation: queue depth grew from ~50 to ~30k overnight and no one noticed. Task: make 'depth' a paging signal. Action: added a Prometheus gauge for queue depth, a Grafana alert on rate > X per minute, and a runbook entry. Result: the next slow-consumer issue was detected and triaged in 18 minutes instead of overnight.
Production references — how grown-up systems do this
Celery →
Celery's official docs are the source of truth for routing, retries, and acks-late semantics.
Shopify →
Shopify's blog on idempotency keys is the canonical reference — same shape used here.
Honeycomb →
Honeycomb's pieces on production observability are excellent reading for the metrics + runbook discipline this project teaches.
Self-review rubric (before you claim done)
Correctness
- Every job idempotent by key.
- Retries respect backoff + jitter.
- DLQ never loses a row.
- Replay re-uses the original idempotency key.
Code quality
- Tasks live in a `tasks/` package, not in routes.
- Backoff config is data, not code.
- No `time.sleep` in workers — `self.retry(countdown=…)` only.
- Logs include job_id + idempotency_key on every line.
Testing
- Property test: 'idempotent under N random retries'.
- Chaos test: kill workers mid-job, verify nothing lost.
- DLQ tool tests cover list/replay/drop.
Docs
- Runbook is a markdown file at /docs/runbook.md.
- Dashboard JSON committed.
- Architecture diagram + 'how would this fail in multi-region?' section.
✱ AI code review
Get a senior-style review before you call it done
Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.
Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.
Need Python first? Start Foundations →