Recommender system — collaborative filtering at million-row scale
Recommenders are everywhere — Netflix, Spotify, Pinterest, Amazon. Building one end-to-end teaches the rare combination of ML modelling AND production system design. This is the L4-tier portfolio piece that signals 'ready for senior IC' on resumes.
Resume bullet (when finished)
“Designed and deployed a hybrid (collaborative-filtering + content-based) recommender for a 1M-user × 100K-item dataset; served via FastAPI with vector lookups in <30 ms p95, evaluated offline against MAP@10 + nDCG, A/B-tested online via shadow deployment.”
Locked tech stack
No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.
Milestones (8 · ~62h)
- M1~6h
Dataset + baseline (popularity recommender)
Load MovieLens 1M or similar. Build a 'recommend the most popular K items the user hasn't seen' baseline. Measure MAP@10.
CHECK BEFORE MOVING ON:
- Why start with a popularity baseline before fancy ML?
- What does MAP@10 capture that simple accuracy doesn't?
$ git commit -m "feat: MovieLens loader + popularity baseline + MAP@10 eval" - M2~10h
Matrix factorization with PyTorch
Implicit-feedback ALS or BPR loss. Train embeddings for users + items, dim=64. Beat popularity by ≥20% MAP@10.
CHECK BEFORE MOVING ON:
- Why implicit feedback for this dataset (ratings or clicks)?
- What's the 'cold-start' problem and how does MF handle it?
$ git commit -m "feat(ml): MF embeddings via PyTorch" - M3~8h
Content-based hybrid component
Tag/genre embeddings for items. Compute weighted mix of CF + content scores per item, weight tuned on val set.
CHECK BEFORE MOVING ON:
- When does CF beat content-based, and vice-versa?
- Why hybrid for cold-start items?
$ git commit -m "feat(ml): content-based blend" - M4~8h
FAISS ANN index — sub-30ms top-K lookup
Build a FAISS index on item embeddings. Per-user top-K served by FAISS search. Measure p95 latency.
CHECK BEFORE MOVING ON:
- Why ANN instead of exact k-NN at million-item scale?
- What's the trade-off — recall@K vs latency vs index size?
$ git commit -m "feat(perf): FAISS ANN index for top-K serving" - M5~8h
FastAPI serving layer + Redis feature store
GET /recommend?user_id=<id>&k=10 → 10 item ids. User embedding pulled from Redis, FAISS search, response under 30ms p95.
CHECK BEFORE MOVING ON:
- Why Redis between FastAPI and the model?
- What goes wrong if you load embeddings from disk per request?
$ git commit -m "feat(serving): FastAPI + Redis feature store" - M6~6h
MLflow experiment tracking + model registry
Every train run logged in MLflow with hyperparams + metrics + artifact. /recommend reads which model id is current.
CHECK BEFORE MOVING ON:
- Why register models with versions instead of pickling?
- How would you roll back a bad release?
$ git commit -m "feat(mlops): MLflow tracking + registry" - M7~8h
Offline eval suite + drift dashboard
Nightly batch computes MAP@10 + nDCG on held-out data. Grafana panel for drift (mean recommendation entropy week-over-week).
CHECK BEFORE MOVING ON:
- What's drift in recommenders and why does it matter?
- Why nDCG in addition to MAP@10?
$ git commit -m "feat(eval): nightly metric job + drift dashboard" - M8~8h
Shadow deployment + A/B test plan
New model serves shadow traffic for 24h. Statistical test outline (sample size, MDE, primary metric, guardrails) in docs.
CHECK BEFORE MOVING ON:
- Why shadow before rolling out?
- What's a minimum-detectable-effect calc and why do PMs care?
$ git commit -m "ops: shadow deploy + A/B test design doc"
60-second demo storyboard
What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.
- 0-5s: 'I built an end-to-end recommender — model, serving, eval, the works.'
- 5-20s: Show the architecture diagram (offline train → MLflow → registry → FAISS index → FastAPI → Redis).
- 20-40s: Live latency demo — POST 1000 reqs, show p95 < 30ms in the Grafana panel.
- 40-55s: Walk one MLops decision (MLflow registry vs git tags, e.g.) and the shadow-deploy plan.
- 55-60s: 'Repo + writeup. Would love to talk about how you'd extend to multi-objective ranking.'
STAR talking points for behavioral round
STAR — SCALING DECISION
Situation: exact k-NN over 100K items × 64-dim embeddings was costing 280ms p95 — too slow. Task: get under 30ms. Action: built a FAISS IVF index with nprobe=20. Measured recall@10 vs full search at 0.97 — acceptable for a recommender. Result: 22ms p95, 12× speedup, 3% recall sacrifice that nobody noticed downstream.
STAR — EVAL DISCIPLINE
Situation: model retrain claimed +8% MAP@10 in train logs. Task: figure out if it actually shipped a real lift. Action: ran the nightly offline eval suite on a true holdout slice — found the +8% was leakage (val split overlapped train by time). Result: caught a model that would've shipped, fixed the split, real lift was +2.3%.
STAR — SHADOW DEPLOY
Situation: new model needed validation before going live. Task: avoid an A/B test that would expose users to a bad model. Action: shadow deploy — new model serves traffic in parallel but response is logged only, not returned. Compared offline-equivalent metrics over 24h. Result: caught a regression in long-tail item coverage that wouldn't have shown in MAP@10 alone. Held release.
Production references — how grown-up systems do this
Netflix →
Netflix Tech Blog's writeups on their recommendation system are the gold standard — hybrid architecture, offline eval discipline, multi-objective ranking.
Spotify →
Spotify Engineering on Annoy + FAISS — the canonical reference for ANN at production scale.
MLflow →
MLflow's model registry docs — the simplest production-ready version of 'tracked experiments + versioned models'.
Self-review rubric (before you claim done)
Correctness
- Offline MAP@10 beats the popularity baseline by ≥20%.
- /recommend p95 latency < 30ms under realistic concurrency.
- Cold-start items return non-empty recommendations.
- Nightly eval job runs reproducibly and gates model promotion.
Code quality
- Train + serve paths share a single embedding-loader module.
- FAISS index is rebuilt by a versioned script committed to the repo.
- All hyperparams via config file or env, never hard-coded.
- Tests cover the feature-store reads + the ranking logic.
Testing
- Offline eval suite committed + runnable in <5min.
- Drift dashboard shows at least 7 days of historical data.
- Integration test exercises the full serve path with synthetic data.
Docs
- Architecture diagram up-top in README.
- Numbers in the README come from a single live training run (not 'made up for the resume').
- Section: 'how would you extend this for multi-objective ranking (relevance + diversity + recency)?'
- Shadow-deploy + A/B test design doc lives in /docs.
✱ AI code review
Get a senior-style review before you call it done
Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.
Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.
Need Python first? Start Foundations →