Skip to main content
← All projects
L4Distributed Systems· 60-100h total

News feed at scale — fan-out, caching, 10K writes/s ready

This is a classic large-scale system-design interview problem you'll see at top engineering teams. Building it for real — not just whiteboarding — turns 'I've read the book' into 'I've shipped it'. A strong signal a junior can carry into a more senior-leaning interview.

Open in GitHub Codespaces· free 60h/moOpen in Gitpod

Resume bullet (when finished)

Designed and built a Twitter-style news feed service: fan-out-on-write timeline assembly, Redis caching, Postgres-as-source-of-truth, load-tested to 10K writes/s with 95p latency under 50 ms. Handles the celebrity-user fan-out edge case explicitly.

Locked tech stack

No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.

Python 3.12FastAPIPostgreSQL (writes)Redis (timeline cache + dedupe)RQ / Celery (background fan-out)asyncpgLocust (load testing)Grafana + Prometheus

Milestones (8 · ~56h)

  1. M1~8h

    Schema + basic Post/Follow/User CRUD

    PostgreSQL schema: users, follows (composite PK), posts (with author_id, created_at index). REST endpoints for create-user, follow, create-post.

    CHECK BEFORE MOVING ON:

    • Why is the follows table's primary key (follower_id, followee_id) instead of a synthetic id?
    • What index do you need for 'show me the latest 50 posts by user X'?
    $ git commit -m "feat: users + follows + posts schema and CRUD"
  2. M2~4h

    Naive timeline — fan-out on read

    GET /timeline/me returns 50 most recent posts from people I follow. Naive query joins follows + posts ORDER BY created_at DESC LIMIT 50.

    CHECK BEFORE MOVING ON:

    • What's the worst-case query cost for a user who follows 10000 people?
    • Why does this fall over before you hit 1000 concurrent users?
    $ git commit -m "feat(timeline): naive fan-out-on-read query"
  3. M3~10h

    Fan-out on write — Redis sorted set per user

    On post-create: enqueue a job. Worker writes post-id into each follower's Redis ZSET (score = created_at). Timeline read becomes ZREVRANGE on caller's ZSET.

    CHECK BEFORE MOVING ON:

    • Why a sorted set (ZSET) instead of a list?
    • What's the worst case if a follower has 100M followers (celebrity problem)?
    $ git commit -m "feat(timeline): redis ZSET fan-out-on-write"
  4. M4~8h

    Celebrity hybrid — pull-on-read for users with >5K followers

    Posts by 'celebrity' authors (>5K followers) skip fan-out. Reader merges celebrity-pull with own ZSET on read. Documented threshold.

    CHECK BEFORE MOVING ON:

    • Where does the 5K threshold come from — what's the trade-off math?
    • How do you detect 'celebrity' status without a stop-the-world re-bucket?
    $ git commit -m "feat(timeline): hybrid celebrity pull-on-read"
  5. M5~6h

    Hot post cache + dedup

    LRU cache (Redis) for post bodies referenced by N+ timelines. Dedup: a post can't appear twice in the same timeline.

    CHECK BEFORE MOVING ON:

    • Why cache the post body separately from the timeline ZSET?
    • What's a good eviction policy for the post-body cache?
    $ git commit -m "feat(cache): hot post body cache + per-timeline dedup"
  6. M6~8h

    Load test with Locust — 10K writes/s sustained

    Synthetic users post, follow, read. Ramp to 10K writes/s. Capture p50/p95/p99 latency for /timeline read. Document where the system bends.

    CHECK BEFORE MOVING ON:

    • Why is p99 the metric that matters, not the average?
    • What's the difference between 'breaks' and 'degrades gracefully' in a load test?
    $ git commit -m "perf: locust suite + 10K write/s baseline"
  7. M7~6h

    Observability — Grafana dashboard

    Prometheus metrics for: fan-out queue depth, ZSET write rate, timeline read p95, celebrity hit rate. Grafana dashboard with screenshots in README.

    CHECK BEFORE MOVING ON:

    • Which of those metrics is the leading indicator vs lagging indicator?
    • What single metric would page you at 3 AM?
    $ git commit -m "ops: prom metrics + grafana dashboard"
  8. M8~6h

    Architecture decision log + chaos test

    ADRs for: write-through vs write-back, celebrity threshold, eviction policy. Chaos test: kill Redis mid-load, document the failure mode, decide retry vs degrade.

    CHECK BEFORE MOVING ON:

    • Why are ADRs important even on a one-person project?
    • What's the difference between availability and consistency in this design?
    $ git commit -m "docs: ADRs + chaos test report"

60-second demo storyboard

What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.

  1. 0-5s: 'I built a Twitter-style feed at FAANG-interview scale — 10K writes/sec, 50ms p95 reads.'
  2. 5-20s: Show the architecture diagram (fan-out, celebrity hybrid, hot-post cache). Talk through it in 15 seconds.
  3. 20-35s: Live Locust run — 10K writes/s sustained, p95 latency dashboard.
  4. 35-50s: Walk one design decision (e.g. 'I picked the 5K celebrity threshold because…').
  5. 50-60s: 'Repo, dashboard, load-test report. Would love to talk about how you'd extend it for retweets.'

STAR talking points for behavioral round

STAR — SCALING DECISION

Situation: naive fan-out-on-read died at ~500 concurrent timeline reads. Task: explain why and fix it. Action: profiled the query — JOIN follows + posts was scanning a huge slice. Switched to fan-out-on-write with Redis ZSETs. Result: timeline read became O(50) ZREVRANGE — sub-10ms p95.

STAR — CELEBRITY EDGE CASE

Situation: pure fan-out-on-write meant a celebrity post created 10M ZSET writes — would have taken minutes. Task: avoid the storm. Action: hybrid: skip fan-out for authors above 5K followers, have readers merge celebrity posts on read. Result: celebrity post-create latency went from minutes to milliseconds; reader work went up by maybe 1ms.

STAR — OBSERVABILITY

Situation: load test passed but I had no visibility into where time went. Task: instrument it. Action: added Prometheus counters for queue depth, ZSET write rate, celebrity hit ratio. Built one Grafana dashboard. Result: caught a slow consumer in the fan-out worker pool in 30 seconds instead of bisecting code for an hour.

Production references — how grown-up systems do this

Twitter / X

Twitter's 2010s engineering blog has the canonical fan-out architecture writeup — read it before starting milestone 3.

Instagram

Instagram's feed-ranking writeup explains the practical version of celebrity hybrid + ranking layered on top.

Discord

Discord's 'How we built our chat infrastructure' is a different shape but uses the same fan-out trade-offs — great compare-and-contrast.

Self-review rubric (before you claim done)

Correctness

  • Timeline returns exactly 50 posts in created_at DESC order, no duplicates.
  • Follows are bidirectional-visible (both directions in queries respect the relationship).
  • Celebrity posts appear in followers' timelines within 1 second of posting.
  • Load test (Locust) reproducibly sustains 10K writes/s without queue-backup explosion.

Code quality

  • Fan-out worker is idempotent — re-running a job doesn't duplicate timeline entries.
  • Redis client uses connection pooling; no per-request connect.
  • Schema migrations versioned with Alembic.
  • No naked SQL strings — everything via SQLAlchemy or asyncpg parameterized queries.

Testing

  • Unit tests for the fan-out worker (happy + retry + celebrity skip).
  • Locust suite committed; one-command rerun.
  • Chaos test documented: kill Redis mid-load, capture behavior.
  • p50/p95/p99 latency reported in the README — actual numbers from your run.

Docs

  • ADR for each major decision (fan-out direction, celebrity threshold, eviction policy).
  • Grafana dashboard JSON + screenshots checked in.
  • README has the architecture diagram up top — not buried halfway down.
  • 'How would you extend this for retweets?' section with concrete answer.

✱ AI code review

Get a senior-style review before you call it done

Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.

Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.

Need Python first? Start Foundations →