Skip to main content
all_articles
AI Engineering2026-05-16 · 14 min read

AI Engineering with Python in 2026: the stack production teams actually ship

If you opened a Python repo at any decent AI-using company today, here's what you'd find. Not what tutorials say, not what conference talks promise — what teams actually run in production. This is a 2026 snapshot.

The default backbone

▶ PYTHONeditable · runs in your browser

The Anthropic SDK is the default for serious work. claude-opus-4-6 is the workhorse model — Sonnet 4.6 for cheaper bulk, Haiku 4.5 for triage. Adaptive thinking is on for anything non-trivial; the model decides how much to think per request and you don't need to guess a budget.

What changed from 2024: budget_tokens is deprecated. Don't pass it on 4.6 models — adaptive thinking replaces it. If a tutorial tells you to set budget_tokens=10000, that tutorial is two years old.

Prompt caching: the easiest 90% cost cut

Any prefix used more than once in 5 minutes should be cached:

▶ PYTHONeditable · runs in your browser

Verify with resp.usage.cache_read_input_tokens > 0. If it's zero across repeated requests, you have a silent cache invalidator — usually a timestamp in the system prompt, dict-iteration order changing across runs, or a logger adding trailing whitespace. Audit by hashing the rendered (tools, system) blob across two consecutive identical requests; if the hashes differ you've found your bug.

The minimum cacheable prefix is ~1024 tokens. Below that, the cache silently doesn't engage.

Tool use: where most apps spend their time

Most AI apps in 2026 aren't chat apps. They're agents that call your tools to do real work. The shape:

▶ PYTHONeditable · runs in your browser

The SDK's tool runner handles the agent loop — calls the function, sends the result back, repeats until Claude is done. For maximum control you can write the loop manually, but for 90% of apps the runner is what you want.

Common mistake: treating tools like RPC. A tool's docstring is the prompt. Write it like a senior engineer explaining the API to a smart junior — what it does, what each param means, what edge cases return. Sloppy docstrings = Claude calling the wrong tool with the wrong args.

RAG: when retrieval helps, when it doesn't

The 2026 RAG recipe most teams converge on:

1. Chunk at 400 tokens, 80-token overlap. PDFs need layout-aware splitting; plain text doesn't.

2. Embed with voyage-3 via Anthropic Batches API (50% discount, 24h SLA — fine for offline indexing).

3. Store in pgvector — works at 10M chunks on a single Postgres, beats most dedicated vector DBs in ops simplicity.

4. Retrieve BM25 top-50 + dense top-50, fuse via Reciprocal Rank Fusion (RRF, k=60), rerank top-10 with a cross-encoder, return top-5 to the LLM.

5. MUST filter on tenant_id server-side if you're multi-tenant. One forgotten filter = one cross-tenant leak = breach.

When RAG hurts: if the answer needs cross-document synthesis, RAG retrieves chunks independently and the LLM can't see the relationships. For those cases, 1M context windows + prompt caching often beat RAG — and you don't get the "lost in the middle" problem because Claude 4.6 attention is good across the full window.

Files API: chat with your document

When the same user asks 10 questions about the same 50-page PDF, you don't want to re-upload it 10 times.

▶ PYTHONeditable · runs in your browser

Pairs perfectly with prompt caching — the document reference + system + glossary all become a single cacheable prefix. 5-question follow-up session: pay full price once, 10% on follow-ups.

Streaming: not optional past 1 second

If a request takes more than 1 second of perceived latency, stream it:

▶ PYTHONeditable · runs in your browser

Use get_final_message() to get the same shape as a non-streamed response — your downstream code doesn't need to branch on streaming vs not. The helper handles text + tool blocks interleaved correctly.

Server-Sent Events is the wire format. For voice agents, time-to-first-token is the perceived-quality metric. Streaming + prompt caching + low effort takes Claude TTFT from ~2s to <700ms — the gap between "feels alive" and "feels broken."

Effort + adaptive thinking: cost knob you can actually tune

▶ PYTHONeditable · runs in your browser

Effort levels: low / medium / high / max. Default is high. medium is often the sweet spot — quality usually unchanged, 30-50% fewer tokens. Run your eval suite against effort=high vs effort=medium; if scores match, ship medium and save the money.

max is Opus-only and reserved for correctness-critical paths — code review, legal doc analysis, escalated support.

Multi-agent: fan-out + fan-in

The pattern that comes up over and over:

▶ PYTHONeditable · runs in your browser

Use cases: process 500 customer-support tickets, summarize 50 source documents, write 100 product descriptions. The planner picks the structure (worth Sonnet's quality), Haiku workers do the parallelizable work (cheap, fast), Sonnet synthesizes (worth the quality again).

Guards: max_iterations on each worker, total max_tokens per request, timeout on the whole orchestration. One stuck worker shouldn't melt the budget.

Evals: the only thing that prevents quality drift

"We tweaked the prompt and it feels better" is how senior teams ship regressions. The fix is a versioned eval suite gated in CI:

1. Golden set — 50-300 labeled examples covering happy + adversarial cases.

2. Metric — exact match for classification, embedding similarity for paraphrase, LLM-as-judge for open-ended (calibrated against humans for a subset).

3. Run on every PR. Block merges that regress by >2% without explicit override.

4. Track over time. Score should be flat or rising; flat is fine.

Without this, prompts silently regress over months. Tutorials skip it because it's boring infrastructure. The teams that ship most reliably do nothing else differently — they just actually have evals.

Compaction: agents that run for hours

Beta feature on 4.6 models — set the compact-2026-01-12 beta header and the API summarizes earlier context when total tokens approach 150K. Critical: append the full response.content (not just the text string) back into your messages. Compaction blocks are opaque server-side markers; strip them and the next turn re-sends the uncompacted history and overflows context.

With compaction on, a research agent or coding agent can run for 500+ turns without you implementing manual summarization. Without it, you implement summarization yourself and it's never as good as Claude's.

What we left out (and why)

  • LangChain. Many teams have removed it. The SDK + tool runner does the same thing with less indirection.
  • OpenAI compatibility shims. Either you're an Anthropic shop or you're not. Mixed wrappers paper over real differences (Anthropic content blocks vs OpenAI strings, tool result shape, etc) and you find out the hard way.
  • Custom vector DBs. Pinecone, Weaviate, Qdrant are fine but pgvector is enough for most teams up to ~50M vectors and the ops story is "you already run Postgres."
  • Self-hosted models. Llama 4 is great. But the price/latency/quality/eval-ops total is hard to beat with Claude Sonnet at $3/$15.

The full stack, in one diagram

| Layer | Choice |

|---|---|

| Model routing | Haiku 4.5 (triage) → Sonnet 4.6 (bulk) → Opus 4.6 (hard) |

| Thinking | adaptive (no budget_tokens) |

| Effort | medium default, max for correctness-critical |

| Caching | cache_control at end of system prompt |

| Tools | SDK tool runner, well-documented signatures |

| RAG | pgvector + BM25 + RRF + cross-encoder rerank |

| Documents | Files API + prompt caching |

| Streaming | always, with get_final_message() |

| Multi-step | fan-out planner + Haiku workers + Sonnet synthesizer |

| Long sessions | compaction (beta) |

| Quality | versioned eval suite in CI |

| Offline jobs | Batches API (50% discount, 24h SLA) |

| Voice | STT + streaming Claude + TTS, <700ms TTFT |

Want to learn this end-to-end?

Our AI Engineering track covers 100 lessons across 6 modules — API fundamentals, tool use, RAG, agent loops, production AI, and the frontier topics above (vision, Files API, Skills API, Batches API, voice agents, multi-agent orchestration). Capstone is a multi-modal customer-support agent with all the production constraints in this article. First 15 lessons free, no signup.

cta_title

cta_body

cta_button

Get one Python lesson + one career idea every Friday

No spam, no "buy our course now". Three bullets, every Friday. Unsubscribe with one click.