::synthesis · Tim-Ferriss method
Vision models (when they help vs distract)
::minimum effective dose
Vision models — Claude with vision, GPT-4o, Gemini multimodal — can accept images as input and reason about their contents. The capability is real: OCR (especially handwritten and structured documents), chart interpretation, screenshot understanding, document layout extraction, UI screenshot debugging, accessibility descriptions, visual QA on consumer products, medical imaging assistance (within strict regulatory limits). The MED of when they HELP: any task that's faster for a human to show than describe — 'what's this error message in my screenshot,' 'extract data from this scanned receipt,' 'is this UI accessible,' 'what's in this chart and what's the trend.' The MED of when they DISTRACT: tasks where the text representation is already cheap and reliable — pasting code from a screenshot when the text was right there, asking the model to count items in an image when a one-line script would count them deterministically, describing complex diagrams when a sketch-to-mermaid prompt would generate the source. The cost reality: vision input is typically 5-10x more expensive per equivalent information than text input. A 1024x1024 image costs roughly the same as 1,000-2,000 text tokens of input, but conveys 50-200 'tokens worth' of useful info for most prompts. Use vision when the image IS the input (you don't have the data in text form). Don't use vision when text would have been equivalent and 10x cheaper.
::DiSSS · deconstruction questions
- 01Is the information in this image AVAILABLE as text elsewhere — and would using the text be cheaper and more reliable?
- 02Is this a vision-native task (chart, screenshot, scan, handwriting) or a text-native task in image clothing?
- 03What's my cost-per-image vs cost-per-equivalent-text — and am I tracking it?
- 04Where does the vision model fail on my workload — small text, hand-drawn diagrams, charts with overlapping labels?
- 05Do I need deterministic output (count, measure, classify) where a script would beat a vision model?
::fear-setting
Cost of not learning this: you'll either ignore vision entirely and miss the OCR/chart/screenshot wins, or overuse vision and pay 5-10x for tasks that text would have handled fine. The overuse mode is much more common. Operators see the demo of 'paste a screenshot and ask anything' and start putting images everywhere — in agent loops that re-screenshot the same UI every step, in pipelines that screenshot text instead of selecting text, in workflows that ask the model to count when a script would count. Cost of getting it wrong: silent vision errors. Vision models hallucinate confidently — they'll tell you a chart shows a trend that's not there, miscount items in an image, misread a number in a receipt. These errors aren't flagged because the model presents them with the same confidence as correct answers. Vision is great for assistance; it's not (yet) great for ground-truth measurement.
::80 / 20 cut
SKIP: complex multimodal embedding pipelines, exotic vision-language model architectures, image generation for tasks that don't need it. OBSESS OVER: (1) text-first defaults — if you have text, use text, (2) cost tracking per image — vision is a 5-10x cost multiplier most operators don't track, (3) vision for vision-native tasks (OCR, charts, screenshots, scans, accessibility), nothing else.
::tribe of mentors · paraphrased stances
Anthropic vision team
Built and document Claude's vision capabilities with honest capability boundaries
Anthropic's stance: vision works well for OCR, chart understanding, and screenshot interpretation; it works less well for precise counting, fine-grained spatial reasoning, and small-text reading. Match the task to the strength.
Andrej Karpathy
Built and demonstrated multimodal models at scale, deeply technical practitioner
Andrej's stance: vision is now a first-class input modality, not a novelty. Operators should think in terms of 'what's the cheapest input modality for this information' and choose accordingly — text when text exists, vision when only image exists.
Simon Willison
Built llm-vision tooling, publishes practical reviews of vision model capabilities and failures
Willison's stance: vision is genuinely useful for the right tasks (OCR, document parsing, accessibility) and a trap for the wrong tasks (counting, precise measurement, replacing text input). Know which side of the line you're on before you bill the call.
Latent Space podcast hosts
Interview vision model practitioners across providers, surface honest production stories
Latent Space stance: vision in production is dominated by document-processing use cases (invoices, receipts, forms, contracts). Most other use cases get demoed loudly and shipped quietly. Follow the production money, not the demo hype.
::real-world test · this week
This week: identify one workflow where you currently paste a screenshot to an LLM. Ask: does this image contain information that exists somewhere as text? If yes, switch to text (copy-paste, OCR upstream, API call) and re-run. Measure the cost difference and the quality difference. In 60-80% of operator cases, text-equivalent is cheaper and more reliable. The other 20-40% is the legitimate vision territory — and you've just identified it precisely.
::action items · ranked
- 01Audit every vision call you make this week — is the information available as text? If yes, switch.
- 02Track cost-per-image separately from cost-per-text-call so the multiplier is visible
- 03Reserve vision for OCR, charts, screenshots, scans, and accessibility — the vision-native tasks
- 04For counting, measuring, classifying, use deterministic scripts instead of vision model interpretation
- 05Build one vision-first workflow (receipt OCR, screenshot debugging, chart extraction) to feel the genuine win