::synthesis · Tim-Ferriss method

Multi-LLM routing in practice

::minimum effective dose

Multi-LLM routing is the practice of sending different tasks to different models — Claude for long-context and writing, GPT for general reasoning and tools, Gemini for cheap bulk and vision, local models for private and high-volume — instead of using one for everything. The operational model: think of LLMs like compute primitives. Haiku/Flash/Mini for classification, routing, simple extraction (~$0.25-1/M tokens). Sonnet/4o for the working day's reasoning, drafting, code (~$3-15/M). Opus/o1/Pro for hard reasoning, novel research, complex multi-step (~$15-75/M). Routing logic doesn't need to be sophisticated to win — three rules covers 90%: (1) Classification and 'is this important?' goes to the cheapest model. (2) Default daily work goes to the mid-tier. (3) Only escalate to the frontier model when mid-tier failed or stakes are high. The honest reality: there is NO model that is best at everything. Claude is better at structured long-form writing and instruction-following. GPT is better at tool use and broad tasks. Gemini is better at native multimodal and price/perf at scale. Local models are better at privacy and offline. Operators who pick a single model become advocates for that model and never see the gain they're leaving on the table. The gain from routing is typically 3-5x cost reduction at the same quality, or significant quality jumps at the same cost.

::DiSSS · deconstruction questions

01What are the three most common tasks I send to LLMs in a week, and is each going to the right tier?
02Do I have a fallback when my primary provider has an outage — or does my whole workflow stop?
03What's the routing decision — file type, task type, stakes, latency, cost ceiling? (Pick a primary axis.)
04How do I evaluate output quality cross-model without bias toward the one I'm used to?
05What's my cost-per-task if I right-routed, and how far off am I from that?

::fear-setting

Cost of not learning this: you're either overpaying (running Opus for tasks Haiku handles) or under-delivering (running Haiku for tasks that need real reasoning). You're also single-provider-fragile — when Claude has an outage, your whole stack stops; when OpenAI has an outage, your whole stack stops. Cost of getting it wrong: when you wire model identity deep into your prompts or your product, switching gets painful. A workflow tuned for Claude's structure can break on GPT and vice-versa. The fix is to build for model-agnostic interfaces from day one (LiteLLM, OpenRouter, abstraction layer) so you can A/B test models in production and switch when a better one ships next quarter — and a better one always ships next quarter.

::80 / 20 cut

SKIP: building your own router with embeddings and classification models. Premature. The three-tier rule (cheap/mid/frontier) covers 90% of routing wins without ML. OBSESS OVER: (1) actually trying every frontier model for two weeks on real work, not benchmarks, (2) maintaining a swap-ready abstraction layer (OpenRouter, LiteLLM, or your own) so model choice is one config change, (3) tracking cost-per-task across models for your specific workload — published benchmarks won't predict your reality.

::tribe of mentors · paraphrased stances

Alex Atallah

Co-founder of OpenRouter, runs the largest multi-model routing infrastructure

Alex's stance: model performance per dollar varies wildly by task type. Operators who route by task class see 5-10x cost wins. Operators who pick one model and stick with it are leaving money and quality on the table — usually both.

Hamel Husain

Builds production LLM systems, writes practitioner essays on evaluation and routing

Hamel's stance: don't trust public benchmarks for your routing decisions. Build a 50-case eval on YOUR tasks and run all three frontier models against it. The right model for your workload is rarely the model topping the leaderboard.

Andrej Karpathy

Founding member of OpenAI, ex-Tesla AI director, deeply technical practitioner

Andrej's stance: treat LLMs as a new kind of computer with multiple CPU options. You wouldn't run web servers on the same hardware as ML training; you shouldn't run classification on the same model as deep reasoning.

Simon Willison

Publishes real comparative reviews of every major model release, no provider affiliation

Willison's stance: the leaderboards are noisy. The thing that matters is whether the model does YOUR job well at a price you can afford. He maintains a personal cheat sheet of 'this model for this task' and updates it monthly.

::real-world test · this week

This week: pick five tasks you ran in the last seven days (one classification, one extraction, one draft writing, one code, one reasoning). Run each through three models — the cheapest of one provider, the mid-tier of another, the frontier of a third. Score outputs blind (have someone else hide the model name). Calculate cost-per-task. You'll typically find one or two tasks where you've been overpaying 5-20x, and one where you've been underpaying (using a model too small for the job). That's the routing map.

::action items · ranked

01Sign up for OpenRouter or LiteLLM and route ONE workflow through it this week — proves the abstraction works
02Build a five-task cross-model eval and run it once a quarter; this is your routing intelligence
03Tier your existing workflows into cheap/mid/frontier based on actual quality requirement, not habit
04Document the swap procedure — what would it take to move from Claude to GPT to Gemini tomorrow if needed
05Track cost-per-task per workflow weekly; the moment you see a tier mismatch, route differently