::synthesis · Tim-Ferriss method

Local models (Ollama setup MED)

::minimum effective dose

Local models run on your own hardware — no API calls, no per-token bill, no data leaving the machine. Ollama is the easiest entry point: one installer, one command to pull a model, OpenAI-compatible API on localhost:11434 out of the box. The honest performance ceiling: a top-tier consumer Mac (M3/M4 Max with 64-128GB unified memory) or a workstation with a 24GB+ GPU can run Llama 3.3 70B, Qwen 2.5 72B, or DeepSeek V2 at usable speeds. These are roughly GPT-3.5-class to GPT-4-class on many tasks — good for drafting, classification, summarization, code completion, extraction. They are not frontier-equivalent on hard reasoning, long context, or novel problems. The MED setup: install Ollama, pull one 8B model (fast, runs on almost anything — try llama3.1:8b or qwen2.5:7b) AND one 70B model if your hardware allows. Wire it to an OpenAI-compatible client (continue.dev for code, Open WebUI for chat, your own scripts via the localhost API). The win is NOT 'local beats frontier' — it's 'local is good enough for the 70% of work that doesn't need frontier, at zero marginal cost, with full privacy, and with no rate limits.' The economic break-even on a $4K machine is usually 3-12 months of equivalent API spend for a heavy operator.

::DiSSS · deconstruction questions

01What's my actual hardware ceiling — GPU VRAM (for CUDA) or unified memory (for Apple Silicon)?
02Which model SIZE matches my hardware — 7B, 14B, 32B, 70B — and at what quantization (Q4, Q5, Q8)?
03What tasks am I willing to run at GPT-3.5-class quality if it's free, private, and unlimited?
04What's my fallback when local can't handle it — and is the handoff seamless or jarring?
05Am I tracking my measured tokens-per-second so I know when to upgrade hardware?

::fear-setting

Cost of not learning this: you'll keep paying retail API rates for tasks that a local 8B model handles fine — classification, summarization, formatting, extraction at volume. You'll also be permanently dependent on internet, provider uptime, and corporate ToS for tasks that should be sovereign. Cost of getting it wrong: most operators get the hardware fit wrong on the first purchase and either over-buy (a $6K GPU rig sitting idle 23 hours a day) or under-buy (a machine that can't run anything above 7B and slowly). The second failure mode is worse: setting expectations on a tiny model, concluding 'local models are bad,' and never trying the size that would actually work. Local is not magic. Hardware matters, model size matters, and the gap between 7B and 70B is enormous.

::80 / 20 cut

SKIP: training your own models, complex MoE setups, exotic quantization debates, the latest research model that just dropped on Hugging Face. OBSESS OVER: (1) getting ONE good local model running and used daily, (2) wiring it into your editor, your shell, and your scripts via the OpenAI-compatible API so it's frictionless, (3) building the habit of asking local first and escalating to frontier only when local fails. The habit is the product.

::tribe of mentors · paraphrased stances

Jeffrey Morgan

Co-creator of Ollama, made local model deployment one command for normal humans

Jeffrey's stance: the bar for 'I can run this on my laptop' has dropped dramatically. Most operators don't realize their existing hardware can already run a model that handles 70% of their tasks; the friction was tooling, and the tooling is now solved.

Georgi Gerganov

Created llama.cpp, the inference engine that made consumer-hardware LLMs possible

Georgi's stance: quantization is the most under-appreciated lever. A 70B model at Q4 fits in 40GB and runs surprisingly fast on consumer hardware; the quality loss vs Q8 is small for most tasks. Most operators should be running larger models at lower quantization, not smaller models at full precision.

Eric Hartford

Maintains a widely-used set of uncensored fine-tunes (Dolphin series), deep practitioner on consumer hardware

Eric's stance: privacy and uncensored exploration are the two real motivators for local models. If neither applies to your workload, you're probably paying a hardware tax for no return. Be honest about why you want local before buying the GPU.

Simon Willison

Runs and reviews local models constantly on consumer hardware, publishes honest performance notes

Willison's stance: local models are now genuinely useful for a large class of tasks, but the expectation-setting is the hard part. Frame them as 'free GPT-3.5' and you'll be delighted; frame them as 'GPT-4 replacement' and you'll be disappointed.

::real-world test · this week

This week: install Ollama, pull llama3.1:8b (about 5GB, runs on most modern laptops). Run `ollama run llama3.1:8b` and have a five-minute conversation about something you'd normally ask Claude or GPT. Then take one specific task you do regularly — summarize a meeting transcript, draft an email, classify some inputs — and run it against the local model. Score the output honestly. If it's adequate, that's a task you can move local permanently. If it's not, you've established the floor and you know to escalate.

::action items · ranked

01Install Ollama and pull one 7-8B model today (15 minutes, no hardware decision required)
02Wire the localhost endpoint into one tool you use daily — editor plugin, shell function, or chat UI
03Run a one-week eval where every prompt goes to local first; track which tasks it handles and which escalate
04Calculate your monthly API spend and divide by the cost of a hardware upgrade — your break-even is your decision
05Pick ONE high-volume privacy-sensitive task (PII extraction, internal document summary) and move it local