Realtime chat with WebSockets, rooms, and presence
Realtime is the feature that separates 'I built CRUD' from 'I built systems'. WebSockets force you to think about presence, backpressure, fan-out, and message ordering — every one of which is an interview question.
Resume bullet (when finished)
“Built a multi-room realtime chat backend in FastAPI + Redis pub/sub, supporting presence, typing indicators, message history, and 500 concurrent WebSocket connections on a single 1-CPU container.”
Locked tech stack
No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.
Milestones (6 · ~23h)
- M1~3h
WebSocket echo + auth
`/ws/{room}` accepts WebSocket, validates JWT in query string, echoes messages.
CHECK BEFORE MOVING ON:
- Why JWT in query, not header?
- What's wrong with persistent unauth'd connections?
$ git commit -m "feat(ws): echo endpoint with JWT auth" - M2~3h
Multi-room broadcast
Messages to one client fan out to everyone in the same room via in-memory dict.
CHECK BEFORE MOVING ON:
- What's the fan-out cost as N rooms × M users grows?
- Where does in-memory fall apart?
$ git commit -m "feat(ws): multi-room broadcast (in-memory)" - M3~5h
Redis pub/sub for multi-node
Replace in-memory fan-out with Redis pub/sub so two API pods can share rooms.
CHECK BEFORE MOVING ON:
- What does pub/sub guarantee — and not?
- Why pub/sub here and not Redis Streams?
$ git commit -m "feat(ws): Redis pub/sub fan-out" - M4~4h
Persistent history + paging
Messages saved to Postgres. `GET /rooms/{id}/messages?before=<cursor>` returns the last 50.
CHECK BEFORE MOVING ON:
- Why cursor pagination, not offset?
- What's the right index on the messages table?
$ git commit -m "feat: persistent history with cursor paging" - M5~4h
Presence + typing
Online users heartbeat every 10s, expire after 30s. Typing events use a separate channel with TTL.
CHECK BEFORE MOVING ON:
- Why heartbeats vs server-side detection?
- What can go wrong if typing events are persisted?
$ git commit -m "feat: presence + typing indicators" - M6~4h
Load test + observability
500 concurrent connections sustained on 1-CPU. Prometheus tracks open conns, broadcast lag, dropped messages.
CHECK BEFORE MOVING ON:
- Why is broadcast lag the right SLO?
- What does p99 of zero typically mean?
$ git commit -m "ops: load test + Prometheus metrics"
60-second demo storyboard
What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.
- 0-5s: 'Realtime chat in FastAPI — 500 concurrent connections, multi-room, presence.'
- 5-25s: open 3 browser tabs, demo typing, presence, and a message round-trip.
- 25-45s: kill one API pod live, show the other pod still receiving messages (Redis pub/sub).
- 45-60s: load test 500-conn graph in Grafana.
STAR talking points for behavioral round
STAR — SCALING REALTIME
Situation: in-memory fan-out worked locally but broke the moment I scaled to two pods. Task: same room, two pods. Action: Redis pub/sub — each pod subscribes to the rooms it has clients in; broadcasts go via PUBLISH. Result: cross-pod messages delivered with <5ms added latency at 500 conns.
STAR — BACKPRESSURE
Situation: one slow client could block the whole broadcast loop. Task: isolate slow consumers. Action: per-client outbound queue with a bounded size; if it fills, drop the client. Result: a single slow tab no longer hurt anyone else.
Production references — how grown-up systems do this
Self-review rubric (before you claim done)
Correctness
- Messages arrive in order within a room.
- Presence accurate within heartbeat window.
- History paging stable under inserts.
- Multi-pod fan-out delivers every message exactly once.
Code quality
- Per-client outbound queue with bounded size.
- No `asyncio.create_task` without registering for cleanup.
- Pub/sub channel names follow a documented convention.
Testing
- Property test: 'no message lost under fan-out'.
- Load test scenario committed.
- Auth tests cover expired + missing JWT.
Docs
- Architecture diagram for fan-out.
- SLO definition: broadcast lag P99 < X ms.
- Section: 'when would you move to a managed service like Pusher or Ably?'
✱ AI code review
Get a senior-style review before you call it done
Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.
Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.
Need Python first? Start Foundations →