News feed at scale — fan-out, caching, 10K writes/s ready
This is a classic large-scale system-design interview problem you'll see at top engineering teams. Building it for real — not just whiteboarding — turns 'I've read the book' into 'I've shipped it'. A strong signal a junior can carry into a more senior-leaning interview.
Resume bullet (when finished)
“Designed and built a Twitter-style news feed service: fan-out-on-write timeline assembly, Redis caching, Postgres-as-source-of-truth, load-tested to 10K writes/s with 95p latency under 50 ms. Handles the celebrity-user fan-out edge case explicitly.”
Locked tech stack
No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.
Milestones (8 · ~56h)
- M1~8h
Schema + basic Post/Follow/User CRUD
PostgreSQL schema: users, follows (composite PK), posts (with author_id, created_at index). REST endpoints for create-user, follow, create-post.
CHECK BEFORE MOVING ON:
- Why is the follows table's primary key (follower_id, followee_id) instead of a synthetic id?
- What index do you need for 'show me the latest 50 posts by user X'?
$ git commit -m "feat: users + follows + posts schema and CRUD" - M2~4h
Naive timeline — fan-out on read
GET /timeline/me returns 50 most recent posts from people I follow. Naive query joins follows + posts ORDER BY created_at DESC LIMIT 50.
CHECK BEFORE MOVING ON:
- What's the worst-case query cost for a user who follows 10000 people?
- Why does this fall over before you hit 1000 concurrent users?
$ git commit -m "feat(timeline): naive fan-out-on-read query" - M3~10h
Fan-out on write — Redis sorted set per user
On post-create: enqueue a job. Worker writes post-id into each follower's Redis ZSET (score = created_at). Timeline read becomes ZREVRANGE on caller's ZSET.
CHECK BEFORE MOVING ON:
- Why a sorted set (ZSET) instead of a list?
- What's the worst case if a follower has 100M followers (celebrity problem)?
$ git commit -m "feat(timeline): redis ZSET fan-out-on-write" - M4~8h
Celebrity hybrid — pull-on-read for users with >5K followers
Posts by 'celebrity' authors (>5K followers) skip fan-out. Reader merges celebrity-pull with own ZSET on read. Documented threshold.
CHECK BEFORE MOVING ON:
- Where does the 5K threshold come from — what's the trade-off math?
- How do you detect 'celebrity' status without a stop-the-world re-bucket?
$ git commit -m "feat(timeline): hybrid celebrity pull-on-read" - M5~6h
Hot post cache + dedup
LRU cache (Redis) for post bodies referenced by N+ timelines. Dedup: a post can't appear twice in the same timeline.
CHECK BEFORE MOVING ON:
- Why cache the post body separately from the timeline ZSET?
- What's a good eviction policy for the post-body cache?
$ git commit -m "feat(cache): hot post body cache + per-timeline dedup" - M6~8h
Load test with Locust — 10K writes/s sustained
Synthetic users post, follow, read. Ramp to 10K writes/s. Capture p50/p95/p99 latency for /timeline read. Document where the system bends.
CHECK BEFORE MOVING ON:
- Why is p99 the metric that matters, not the average?
- What's the difference between 'breaks' and 'degrades gracefully' in a load test?
$ git commit -m "perf: locust suite + 10K write/s baseline" - M7~6h
Observability — Grafana dashboard
Prometheus metrics for: fan-out queue depth, ZSET write rate, timeline read p95, celebrity hit rate. Grafana dashboard with screenshots in README.
CHECK BEFORE MOVING ON:
- Which of those metrics is the leading indicator vs lagging indicator?
- What single metric would page you at 3 AM?
$ git commit -m "ops: prom metrics + grafana dashboard" - M8~6h
Architecture decision log + chaos test
ADRs for: write-through vs write-back, celebrity threshold, eviction policy. Chaos test: kill Redis mid-load, document the failure mode, decide retry vs degrade.
CHECK BEFORE MOVING ON:
- Why are ADRs important even on a one-person project?
- What's the difference between availability and consistency in this design?
$ git commit -m "docs: ADRs + chaos test report"
60-second demo storyboard
What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.
- 0-5s: 'I built a Twitter-style feed at FAANG-interview scale — 10K writes/sec, 50ms p95 reads.'
- 5-20s: Show the architecture diagram (fan-out, celebrity hybrid, hot-post cache). Talk through it in 15 seconds.
- 20-35s: Live Locust run — 10K writes/s sustained, p95 latency dashboard.
- 35-50s: Walk one design decision (e.g. 'I picked the 5K celebrity threshold because…').
- 50-60s: 'Repo, dashboard, load-test report. Would love to talk about how you'd extend it for retweets.'
STAR talking points for behavioral round
STAR — SCALING DECISION
Situation: naive fan-out-on-read died at ~500 concurrent timeline reads. Task: explain why and fix it. Action: profiled the query — JOIN follows + posts was scanning a huge slice. Switched to fan-out-on-write with Redis ZSETs. Result: timeline read became O(50) ZREVRANGE — sub-10ms p95.
STAR — CELEBRITY EDGE CASE
Situation: pure fan-out-on-write meant a celebrity post created 10M ZSET writes — would have taken minutes. Task: avoid the storm. Action: hybrid: skip fan-out for authors above 5K followers, have readers merge celebrity posts on read. Result: celebrity post-create latency went from minutes to milliseconds; reader work went up by maybe 1ms.
STAR — OBSERVABILITY
Situation: load test passed but I had no visibility into where time went. Task: instrument it. Action: added Prometheus counters for queue depth, ZSET write rate, celebrity hit ratio. Built one Grafana dashboard. Result: caught a slow consumer in the fan-out worker pool in 30 seconds instead of bisecting code for an hour.
Production references — how grown-up systems do this
Twitter / X →
Twitter's 2010s engineering blog has the canonical fan-out architecture writeup — read it before starting milestone 3.
Instagram →
Instagram's feed-ranking writeup explains the practical version of celebrity hybrid + ranking layered on top.
Discord →
Discord's 'How we built our chat infrastructure' is a different shape but uses the same fan-out trade-offs — great compare-and-contrast.
Self-review rubric (before you claim done)
Correctness
- Timeline returns exactly 50 posts in created_at DESC order, no duplicates.
- Follows are bidirectional-visible (both directions in queries respect the relationship).
- Celebrity posts appear in followers' timelines within 1 second of posting.
- Load test (Locust) reproducibly sustains 10K writes/s without queue-backup explosion.
Code quality
- Fan-out worker is idempotent — re-running a job doesn't duplicate timeline entries.
- Redis client uses connection pooling; no per-request connect.
- Schema migrations versioned with Alembic.
- No naked SQL strings — everything via SQLAlchemy or asyncpg parameterized queries.
Testing
- Unit tests for the fan-out worker (happy + retry + celebrity skip).
- Locust suite committed; one-command rerun.
- Chaos test documented: kill Redis mid-load, capture behavior.
- p50/p95/p99 latency reported in the README — actual numbers from your run.
Docs
- ADR for each major decision (fan-out direction, celebrity threshold, eviction policy).
- Grafana dashboard JSON + screenshots checked in.
- README has the architecture diagram up top — not buried halfway down.
- 'How would you extend this for retweets?' section with concrete answer.
✱ AI code review
Get a senior-style review before you call it done
Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.
Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.
Need Python first? Start Foundations →