Is CodeMentor AI free?

The first 15 Python lessons are free with no signup and no credit card. After that, the 7-day Pro trial unlocks every track; cancel anytime. Pro is $12/month or $89/year.

Can I learn Python without installing anything?

Yes. Every lesson runs Python in your browser — Skulpt for lightweight lessons and Pyodide (full CPython) on the playground. No Anaconda, no pyenv, no terminal commands. Open the page and hit Run.

Is CodeMentor AI good for complete beginners?

Yes — the Foundations track starts with print('Hello, World!') and assumes zero programming background. The first 15 lessons are free to verify the difficulty curve matches you before any signup.

Does the AI tutor replace a human mentor?

It replaces 80% of "I'm stuck at 21:00 and Stack Overflow scared me" moments. You get hints calibrated to your code + a chat for follow-up questions. For project review and career advice the team also answers support@learnpython.academy directly.

Can I learn Python for an AI engineering job?

Yes. The AI Engineering track covers production patterns the US dev community uses in 2026 — Claude/LLM APIs, tool use, RAG, agent loops, prompt caching, evals, voice agents. Build production AI features end-to-end.

Are the courses available in languages other than English?

Yes — the platform UI and most lessons are translated into 18 languages including Ukrainian, Russian, Polish, German, French, Spanish, Portuguese, and more. Pick yours in the language switcher.

← All projects

L4ML · Distributed Systems· 50-85h total

Recommender system — collaborative filtering at million-row scale

Recommenders are everywhere — Netflix, Spotify, Pinterest, Amazon. Building one end-to-end teaches the rare combination of ML modelling AND production system design. This is the L4-tier portfolio piece that signals 'ready for senior IC' on resumes.

Resume bullet (when finished)

“Designed and deployed a hybrid (collaborative-filtering + content-based) recommender for a 1M-user × 100K-item dataset; served via FastAPI with vector lookups in <30 ms p95, evaluated offline against MAP@10 + nDCG, A/B-tested online via shadow deployment.”

Locked tech stack

No "choose your language" — analysis paralysis kills completion. Follow the stack to the letter on your first build.

Python 3.12PyTorch (matrix factorization)FAISS (ANN search)FastAPIRedis (feature store)MLflowDocker

Milestones (8 · ~62h)

M1~6h
Dataset + baseline (popularity recommender)
Load MovieLens 1M or similar. Build a 'recommend the most popular K items the user hasn't seen' baseline. Measure MAP@10.
CHECK BEFORE MOVING ON:
- Why start with a popularity baseline before fancy ML?
- What does MAP@10 capture that simple accuracy doesn't?
$ git commit -m "feat: MovieLens loader + popularity baseline + MAP@10 eval"
M2~10h
Matrix factorization with PyTorch
Implicit-feedback ALS or BPR loss. Train embeddings for users + items, dim=64. Beat popularity by ≥20% MAP@10.
CHECK BEFORE MOVING ON:
- Why implicit feedback for this dataset (ratings or clicks)?
- What's the 'cold-start' problem and how does MF handle it?
$ git commit -m "feat(ml): MF embeddings via PyTorch"
M3~8h
Content-based hybrid component
Tag/genre embeddings for items. Compute weighted mix of CF + content scores per item, weight tuned on val set.
CHECK BEFORE MOVING ON:
- When does CF beat content-based, and vice-versa?
- Why hybrid for cold-start items?
$ git commit -m "feat(ml): content-based blend"
M4~8h
FAISS ANN index — sub-30ms top-K lookup
Build a FAISS index on item embeddings. Per-user top-K served by FAISS search. Measure p95 latency.
CHECK BEFORE MOVING ON:
- Why ANN instead of exact k-NN at million-item scale?
- What's the trade-off — recall@K vs latency vs index size?
$ git commit -m "feat(perf): FAISS ANN index for top-K serving"
M5~8h
FastAPI serving layer + Redis feature store
GET /recommend?user_id=<id>&k=10 → 10 item ids. User embedding pulled from Redis, FAISS search, response under 30ms p95.
CHECK BEFORE MOVING ON:
- Why Redis between FastAPI and the model?
- What goes wrong if you load embeddings from disk per request?
$ git commit -m "feat(serving): FastAPI + Redis feature store"
M6~6h
MLflow experiment tracking + model registry
Every train run logged in MLflow with hyperparams + metrics + artifact. /recommend reads which model id is current.
CHECK BEFORE MOVING ON:
- Why register models with versions instead of pickling?
- How would you roll back a bad release?
$ git commit -m "feat(mlops): MLflow tracking + registry"
M7~8h
Offline eval suite + drift dashboard
Nightly batch computes MAP@10 + nDCG on held-out data. Grafana panel for drift (mean recommendation entropy week-over-week).
CHECK BEFORE MOVING ON:
- What's drift in recommenders and why does it matter?
- Why nDCG in addition to MAP@10?
$ git commit -m "feat(eval): nightly metric job + drift dashboard"
M8~8h
Shadow deployment + A/B test plan
New model serves shadow traffic for 24h. Statistical test outline (sample size, MDE, primary metric, guardrails) in docs.
CHECK BEFORE MOVING ON:
- Why shadow before rolling out?
- What's a minimum-detectable-effect calc and why do PMs care?
$ git commit -m "ops: shadow deploy + A/B test design doc"

60-second demo storyboard

What you say in the recruiter screen when they ask "tell me about your latest project." Practice it out loud.

0-5s: 'I built an end-to-end recommender — model, serving, eval, the works.'
5-20s: Show the architecture diagram (offline train → MLflow → registry → FAISS index → FastAPI → Redis).
20-40s: Live latency demo — POST 1000 reqs, show p95 < 30ms in the Grafana panel.
40-55s: Walk one MLops decision (MLflow registry vs git tags, e.g.) and the shadow-deploy plan.
55-60s: 'Repo + writeup. Would love to talk about how you'd extend to multi-objective ranking.'

STAR talking points for behavioral round

STAR — SCALING DECISION

Situation: exact k-NN over 100K items × 64-dim embeddings was costing 280ms p95 — too slow. Task: get under 30ms. Action: built a FAISS IVF index with nprobe=20. Measured recall@10 vs full search at 0.97 — acceptable for a recommender. Result: 22ms p95, 12× speedup, 3% recall sacrifice that nobody noticed downstream.

STAR — EVAL DISCIPLINE

Situation: model retrain claimed +8% MAP@10 in train logs. Task: figure out if it actually shipped a real lift. Action: ran the nightly offline eval suite on a true holdout slice — found the +8% was leakage (val split overlapped train by time). Result: caught a model that would've shipped, fixed the split, real lift was +2.3%.

STAR — SHADOW DEPLOY

Situation: new model needed validation before going live. Task: avoid an A/B test that would expose users to a bad model. Action: shadow deploy — new model serves traffic in parallel but response is logged only, not returned. Compared offline-equivalent metrics over 24h. Result: caught a regression in long-tail item coverage that wouldn't have shown in MAP@10 alone. Held release.

Production references — how grown-up systems do this

Netflix →

Netflix Tech Blog's writeups on their recommendation system are the gold standard — hybrid architecture, offline eval discipline, multi-objective ranking.

Spotify →

Spotify Engineering on Annoy + FAISS — the canonical reference for ANN at production scale.

MLflow →

MLflow's model registry docs — the simplest production-ready version of 'tracked experiments + versioned models'.

Self-review rubric (before you claim done)

Correctness

Offline MAP@10 beats the popularity baseline by ≥20%.
/recommend p95 latency < 30ms under realistic concurrency.
Cold-start items return non-empty recommendations.
Nightly eval job runs reproducibly and gates model promotion.

Code quality

Train + serve paths share a single embedding-loader module.
FAISS index is rebuilt by a versioned script committed to the repo.
All hyperparams via config file or env, never hard-coded.
Tests cover the feature-store reads + the ranking logic.

Testing

Offline eval suite committed + runnable in <5min.
Drift dashboard shows at least 7 days of historical data.
Integration test exercises the full serve path with synthetic data.

Docs

Architecture diagram up-top in README.
Numbers in the README come from a single live training run (not 'made up for the resume').
Section: 'how would you extend this for multi-objective ranking (relevance + diversity + recency)?'
Shadow-deploy + A/B test design doc lives in /docs.

✱ AI code review

Get a senior-style review before you call it done

Push your finished work to GitHub, open a PR, paste the PR URL below. Claude reviews the diff against this project's rubric and replies with strengths, must-fix items, and one teachable principle.

Tick the rubric items honestly, write the README, push to GitHub, get the AI review above. Once it's clean, email support@learnpython.academy with the repo link — we feature the best ones on /success-stories.

Need Python first? Start Foundations →