Skip to main content
← All projects
L3Backend · realtime· 20-32h total

Realtime chat with WebSockets, rooms, and presence

Realtime is the feature that separates 'I built CRUD' from 'I built systems'. WebSockets force you to think about presence, backpressure, fan-out, and message ordering — every one of which is an interview question.

Resume bullet (when finished)

Built a multi-room realtime chat backend in FastAPI + Redis pub/sub, supporting presence, typing indicators, message history, and 500 concurrent WebSocket connections on a single 1-CPU container.

Locked tech stack

No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.

Python 3.12FastAPIWebSocketsRedis pub/subPostgreSQLasyncio

Milestones (6 · ~23h)

  1. M1~3h

    WebSocket echo + auth

    `/ws/{room}` accepts WebSocket, validates JWT in query string, echoes messages.

    CHECK BEFORE MOVING ON:

    • Why JWT in query, not header?
    • What's wrong with persistent unauth'd connections?
    $ git commit -m "feat(ws): echo endpoint with JWT auth"
  2. M2~3h

    Multi-room broadcast

    Messages to one client fan out to everyone in the same room via in-memory dict.

    CHECK BEFORE MOVING ON:

    • What's the fan-out cost as N rooms × M users grows?
    • Where does in-memory fall apart?
    $ git commit -m "feat(ws): multi-room broadcast (in-memory)"
  3. M3~5h

    Redis pub/sub for multi-node

    Replace in-memory fan-out with Redis pub/sub so two API pods can share rooms.

    CHECK BEFORE MOVING ON:

    • What does pub/sub guarantee — and not?
    • Why pub/sub here and not Redis Streams?
    $ git commit -m "feat(ws): Redis pub/sub fan-out"
  4. M4~4h

    Persistent history + paging

    Messages saved to Postgres. `GET /rooms/{id}/messages?before=<cursor>` returns the last 50.

    CHECK BEFORE MOVING ON:

    • Why cursor pagination, not offset?
    • What's the right index on the messages table?
    $ git commit -m "feat: persistent history with cursor paging"
  5. M5~4h

    Presence + typing

    Online users heartbeat every 10s, expire after 30s. Typing events use a separate channel with TTL.

    CHECK BEFORE MOVING ON:

    • Why heartbeats vs server-side detection?
    • What can go wrong if typing events are persisted?
    $ git commit -m "feat: presence + typing indicators"
  6. M6~4h

    Load test + observability

    500 concurrent connections sustained on 1-CPU. Prometheus tracks open conns, broadcast lag, dropped messages.

    CHECK BEFORE MOVING ON:

    • Why is broadcast lag the right SLO?
    • What does p99 of zero typically mean?
    $ git commit -m "ops: load test + Prometheus metrics"

60-second demo storyboard

What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.

  1. 0-5s: 'Realtime chat in FastAPI — 500 concurrent connections, multi-room, presence.'
  2. 5-25s: open 3 browser tabs, demo typing, presence, and a message round-trip.
  3. 25-45s: kill one API pod live, show the other pod still receiving messages (Redis pub/sub).
  4. 45-60s: load test 500-conn graph in Grafana.

STAR talking points for behavioral round

STAR — SCALING REALTIME

Situation: in-memory fan-out worked locally but broke the moment I scaled to two pods. Task: same room, two pods. Action: Redis pub/sub — each pod subscribes to the rooms it has clients in; broadcasts go via PUBLISH. Result: cross-pod messages delivered with <5ms added latency at 500 conns.

STAR — BACKPRESSURE

Situation: one slow client could block the whole broadcast loop. Task: isolate slow consumers. Action: per-client outbound queue with a bounded size; if it fills, drop the client. Result: a single slow tab no longer hurt anyone else.

Production references — how grown-up systems do this

Discord

Discord's engineering blog on their gateway service is the canonical writeup for production WebSocket fan-out at scale.

FastAPI

FastAPI's WebSocket docs cover the async patterns this project relies on.

Self-review rubric (before you claim done)

Correctness

  • Messages arrive in order within a room.
  • Presence accurate within heartbeat window.
  • History paging stable under inserts.
  • Multi-pod fan-out delivers every message exactly once.

Code quality

  • Per-client outbound queue with bounded size.
  • No `asyncio.create_task` without registering for cleanup.
  • Pub/sub channel names follow a documented convention.

Testing

  • Property test: 'no message lost under fan-out'.
  • Load test scenario committed.
  • Auth tests cover expired + missing JWT.

Docs

  • Architecture diagram for fan-out.
  • SLO definition: broadcast lag P99 < X ms.
  • Section: 'when would you move to a managed service like Pusher or Ably?'

✱ AI code review

Get a senior-style review before you call it done

Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.

Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.

Need Python first? Start Foundations →