L34 · Operator~20 min · free · cc-by 4.0

Vision models: when to use them

Vision lets the model see images · powerful for screenshots and diagrams · weak for precise spatial work · know the line.

::TL;DR · the whole lesson in three lines

MOVEVision lets the model see images · powerful for screenshots and diagrams · weak for precise spatial work · know the line.
DRILLYou will run the same task three ways · text-only, image-only, and text-plus-image · to feel which combination wins for your real work.
WINYou have three responses to compare honestly.

jump to drill ↓or read the full concept first →

::concept · what's actually happening

Vision-capable models can take an image as input alongside text and reason about its contents · they read screenshots, interpret diagrams, describe photos, transcribe handwriting, and find UI elements. The capability changed what 'paste this in' can mean.

read full concept · 4 more paragraphs →

The strongest use cases are descriptive and interpretive · 'what does this error message say,' 'summarize this whiteboard photo,' 'is the layout broken in this screenshot,' 'extract the table from this scanned page.' The model reads the image like a literate person scanning a document.

The weakest use cases are precise spatial · 'click exactly here,' 'measure this distance in pixels,' 'count exactly how many widgets are in this photo.' Vision models hallucinate coordinates and miscount past a handful. They are interpreters, not measurement tools.

Mixed text-and-image prompting unlocks workflows that pure text cannot · debugging UI bugs by sharing a screenshot plus describing expected behavior, reviewing design mockups, reading dashboards. The model treats the image as additional context, not as a replacement for instructions.

Privacy gets tricky fast · screenshots often contain incidental PII (email previews, names, account balances), and people upload them without scrubbing. Vision raises the stakes of 'what did I just send?' considerably.

::drill · do the thing

You will run the same task three ways · text-only, image-only, and text-plus-image · to feel which combination wins for your real work.

::L34 drill · copy-paste into any AI chat

I am calibrating when to use vision input. Here is a real task I do: [DESCRIBE THE TASK · e.g. 'debugging why a webpage looks wrong,' 'extracting data from a chart in a PDF']. I will try this three ways and want your honest verdict each time: 1) text-only · I describe the situation in words, you respond. 2) image-only · I paste a screenshot with no description, you respond. 3) text-plus-image · I paste the screenshot AND describe what I want, you respond. For each round, tell me what you can and cannot see clearly, and what you would need from me to do better. After all three, give a one-paragraph verdict on which mode wins for tasks of this shape.

I am calibrating when to use vision input. Here is a real task I do: [DESCRIBE THE TASK · e.g. 'debugging why a webpage looks wrong,' 'extracting data from a chart in a PDF']. I will try this three ways and want your honest verdict each time: 1) text-only · I describe the situation in words, you respond. 2) image-only · I paste a screenshot with no description, you respond. 3) text-plus-image · I paste the screenshot AND describe what I want, you respond. For each round, tell me what you can and cannot see clearly, and what you would need from me to do better. After all three, give a one-paragraph verdict on which mode wins for tasks of this shape.

::or open one in a new tab — then paste

Claude↗ChatGPT↗Gemini↗

::steps

01Pick a real task where vision might help (UI debugging, chart reading, layout review).
02Run round 1 · text-only description.
03Run round 2 · paste screenshot with no description.
04Run round 3 · screenshot plus your description together.
05Read the three responses side by side · note quality and effort.
06Write down which mode you will default to for this task shape.

::outcome · what should be true

You have three responses to compare honestly.
You can name two task types where vision wins and one where it loses.
You have a default mode picked for the task you tested.
You scrubbed PII from the screenshot before sharing (or noticed you didn't).

::trap · the most common failure

Operators paste screenshots like reflex and skip describing what they want · the model then guesses what is wrong with the image, often correctly but sometimes spectacularly wrong. Vision is a context channel, not a mind-reading channel.

::end of the curriculum

You're at Pilot level. There's no Level 6.

The next move is doing the work, not another lesson. If you want operator-grade infrastructure, that's /orangebox. If you want the lab's working journal, /founders-view. If you want to collaborate on the curriculum itself, the source is public on GitHub.

::other lessons at Operator level

L10~30 min

← back to /learn full lesson library →

Vision models: when to use them

You're at Pilot level. There's no Level 6.

Local AI · Ollama — privacy, offline, and the limit of free

Model routing — switching between Claude, GPT, Gemini mid-task

MCP servers — the plug socket that turned AI into a real tool

Agent mode — when AI takes action, not just answers

Computer use — when AI takes the mouse and keyboard

What AI cannot replace — taste, judgment, relationships

Agents 101: model plus tools plus loop

MCP: structured tools for AI

Skill primers: teach a session your context in 30 seconds

Local models with Ollama

Audio and Whisper transcription

RAG vs long context: when to retrieve, when to dump

Embeddings: meaning as numbers

Fine-tuning vs prompt engineering

AI safety in personal use

Multimodal prompting: combining text, image, audio

Chain-of-thought: making the model show its work

Tool use and structured output

Cost optimization: tokens, caching, model selection