Task
Build `eval_run(prompts, expected, model)` β the smallest eval harness that's still useful.
1. Mock the model: `mock(i, p) = expected[i] if i % 2 == 0 else "WRONG"`. (Real eval would call the SDK here; we mock so it's deterministic.)
2. For each prompt, compare model output to `expected[i]`. On miss, append `(prompt, expected, got)` to `failures`.
3. Return `{"pass_rate": <float>, "failures": [...], "model": model}`.
The harness runs 4 prompts with even indices passing and odd indices failing; you should see pass_rate=0.5 and two failure tuples.