Skip to main content
← All projects
L4Backend · infra· 30-50h total

Distributed task queue with retries, dead-letter, and metrics

Every non-trivial backend has 'do this thing eventually' — emails, image processing, retries. Building a real queue (not 'I once used celery_send_task') teaches idempotency, fan-out vs fan-in, retry semantics, observability, and the messy reality of jobs that fail in production at 3am.

Resume bullet (when finished)

Built a Celery-based distributed task queue running on 4 worker nodes processing 12k jobs/min, with idempotent retries, dead-letter routing, Prometheus metrics, and a flake-investigation runbook used in real incidents.

Locked tech stack

No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.

Python 3.12CeleryRedisPostgreSQLFlowerPrometheusDocker Compose

Milestones (6 · ~30h)

  1. M1~4h

    Celery skeleton + first job

    FastAPI enqueues, Celery worker consumes. Compose stack: API, worker, Redis, Postgres.

    CHECK BEFORE MOVING ON:

    • Why Celery over RQ or Dramatiq here?
    • Where does the broker live and why?
    $ git commit -m "feat(queue): Celery + first job"
  2. M2~5h

    Idempotent retries

    Jobs accept a `job_id`; the worker uses it as a Postgres uniqueness key. Re-running a retried job is a no-op.

    CHECK BEFORE MOVING ON:

    • Why idempotency keys instead of 'check before write'?
    • What's at-least-once vs exactly-once and which do you really have?
    $ git commit -m "feat: idempotency keys + safe retries"
  3. M3~5h

    Exponential backoff + max retries

    Failed jobs retry at 2s, 4s, 8s, 16s, 32s with jitter. After 5 fails → dead-letter.

    CHECK BEFORE MOVING ON:

    • Why jitter and not pure exponential?
    • What's the trade-off in choosing max-retries?
    $ git commit -m "feat: exponential backoff with jitter + DLQ"
  4. M4~5h

    Dead-letter routing + replay tool

    DLQ stored in Postgres. CLI tool `dlq-tool list / replay / drop`. Reasons captured.

    CHECK BEFORE MOVING ON:

    • Why store the DLQ in Postgres and not Redis?
    • What's the difference between 'replay' and 'drop'?
    $ git commit -m "feat(dlq): Postgres DLQ + replay CLI"
  5. M5~6h

    Prometheus + Grafana dashboard

    Throughput, error rate, retry rate, DLQ depth, P95 job duration. Dashboard JSON committed.

    CHECK BEFORE MOVING ON:

    • Which metric do you wake up at 3am for?
    • What's the difference between a counter, a gauge, and a histogram?
    $ git commit -m "ops: Prometheus + Grafana dashboard"
  6. M6~5h

    Runbook + load test

    12k jobs/min on 4 workers. Runbook for 'queue depth growing', 'worker crashloop', 'DLQ flooding'.

    CHECK BEFORE MOVING ON:

    • Why does the runbook live in the repo, not Confluence?
    • What's the first command you run when queue depth is growing?
    $ git commit -m "docs: runbook + load test results"

60-second demo storyboard

What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.

  1. 0-5s: 'Distributed task queue: 12k jobs/min, idempotent retries, DLQ + replay.'
  2. 5-25s: live demo — submit 1000 jobs, watch the Grafana dashboard light up.
  3. 25-45s: intentionally crash a worker, show retries on the dashboard, no jobs lost.
  4. 45-60s: walk through one DLQ entry and the replay command.

STAR talking points for behavioral round

STAR — IDEMPOTENCY

Situation: a payment-side job was being retried by Celery on transient DB failures — and the 2nd run was double-charging users. Task: make every job safe to retry. Action: every job started accepting an idempotency_key, recorded the (key, outcome) tuple in a Postgres table, and returned the cached outcome on duplicates. Result: zero double-charges in the next 90 days, even with a daily-rotated set of forced-retry chaos jobs.

STAR — OBSERVABILITY

Situation: queue depth grew from ~50 to ~30k overnight and no one noticed. Task: make 'depth' a paging signal. Action: added a Prometheus gauge for queue depth, a Grafana alert on rate > X per minute, and a runbook entry. Result: the next slow-consumer issue was detected and triaged in 18 minutes instead of overnight.

Production references — how grown-up systems do this

Celery

Celery's official docs are the source of truth for routing, retries, and acks-late semantics.

Shopify

Shopify's blog on idempotency keys is the canonical reference — same shape used here.

Honeycomb

Honeycomb's pieces on production observability are excellent reading for the metrics + runbook discipline this project teaches.

Self-review rubric (before you claim done)

Correctness

  • Every job idempotent by key.
  • Retries respect backoff + jitter.
  • DLQ never loses a row.
  • Replay re-uses the original idempotency key.

Code quality

  • Tasks live in a `tasks/` package, not in routes.
  • Backoff config is data, not code.
  • No `time.sleep` in workers — `self.retry(countdown=…)` only.
  • Logs include job_id + idempotency_key on every line.

Testing

  • Property test: 'idempotent under N random retries'.
  • Chaos test: kill workers mid-job, verify nothing lost.
  • DLQ tool tests cover list/replay/drop.

Docs

  • Runbook is a markdown file at /docs/runbook.md.
  • Dashboard JSON committed.
  • Architecture diagram + 'how would this fail in multi-region?' section.

✱ AI code review

Get a senior-style review before you call it done

Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.

Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.

Need Python first? Start Foundations →