Vision models: when to use them
Vision lets the model see images · powerful for screenshots and diagrams · weak for precise spatial work · know the line.
::TL;DR · the whole lesson in three lines
- MOVEVision lets the model see images · powerful for screenshots and diagrams · weak for precise spatial work · know the line.
- DRILLYou will run the same task three ways · text-only, image-only, and text-plus-image · to feel which combination wins for your real work.
- WINYou have three responses to compare honestly.
::concept · what's actually happening
Vision-capable models can take an image as input alongside text and reason about its contents · they read screenshots, interpret diagrams, describe photos, transcribe handwriting, and find UI elements. The capability changed what 'paste this in' can mean.
read full concept · 4 more paragraphs →collapse concept ↑
The strongest use cases are descriptive and interpretive · 'what does this error message say,' 'summarize this whiteboard photo,' 'is the layout broken in this screenshot,' 'extract the table from this scanned page.' The model reads the image like a literate person scanning a document.
The weakest use cases are precise spatial · 'click exactly here,' 'measure this distance in pixels,' 'count exactly how many widgets are in this photo.' Vision models hallucinate coordinates and miscount past a handful. They are interpreters, not measurement tools.
Mixed text-and-image prompting unlocks workflows that pure text cannot · debugging UI bugs by sharing a screenshot plus describing expected behavior, reviewing design mockups, reading dashboards. The model treats the image as additional context, not as a replacement for instructions.
Privacy gets tricky fast · screenshots often contain incidental PII (email previews, names, account balances), and people upload them without scrubbing. Vision raises the stakes of 'what did I just send?' considerably.
::drill · do the thing
You will run the same task three ways · text-only, image-only, and text-plus-image · to feel which combination wins for your real work.
::L34 drill · copy-paste into any AI chat
I am calibrating when to use vision input. Here is a real task I do: [DESCRIBE THE TASK · e.g. 'debugging why a webpage looks wrong,' 'extracting data from a chart in a PDF']. I will try this three ways and want your honest verdict each time: 1) text-only · I describe the situation in words, you respond. 2) image-only · I paste a screenshot with no description, you respond. 3) text-plus-image · I paste the screenshot AND describe what I want, you respond. For each round, tell me what you can and cannot see clearly, and what you would need from me to do better. After all three, give a one-paragraph verdict on which mode wins for tasks of this shape.
::steps
- 01Pick a real task where vision might help (UI debugging, chart reading, layout review).
- 02Run round 1 · text-only description.
- 03Run round 2 · paste screenshot with no description.
- 04Run round 3 · screenshot plus your description together.
- 05Read the three responses side by side · note quality and effort.
- 06Write down which mode you will default to for this task shape.
::outcome · what should be true
- You have three responses to compare honestly.
- You can name two task types where vision wins and one where it loses.
- You have a default mode picked for the task you tested.
- You scrubbed PII from the screenshot before sharing (or noticed you didn't).
::trap · the most common failure
Operators paste screenshots like reflex and skip describing what they want · the model then guesses what is wrong with the image, often correctly but sometimes spectacularly wrong. Vision is a context channel, not a mind-reading channel.
::end of the curriculum
You're at Pilot level. There's no Level 6.
The next move is doing the work, not another lesson. If you want operator-grade infrastructure, that's /orangebox. If you want the lab's working journal, /founders-view. If you want to collaborate on the curriculum itself, the source is public on GitHub.
::other lessons at Operator level
Local AI · Ollama — privacy, offline, and the limit of free
At Operator level you need an honest opinion about local-only AI. Even if you don't use it daily, you should have run it once.
Model routing — switching between Claude, GPT, Gemini mid-task
Operators don't pick one AI. They route each task to the model that does it best. Knowing the strengths is the skill.
MCP servers — the plug socket that turned AI into a real tool
Model Context Protocol is the standard plug. Knowing what plugs in changes what your AI can actually touch — your files, your inbox, your calendar, your repos.
Agent mode — when AI takes action, not just answers
The frontier of useful AI is agents that DO things — browse, click, file, send. The actual skill is the safety pattern, not the magic.
Computer use — when AI takes the mouse and keyboard
Claude in Chrome, ChatGPT Atlas, computer-use beta — the frontier is AI that drives your browser like a human. Knowing the safety pattern is the actual skill.
What AI cannot replace — taste, judgment, relationships
The operators winning in 2026 are the ones who learned what AI is for and what is theirs. Knowing the line is more valuable than any prompt.
Agents 101: model plus tools plus loop
An agent is a model with tools running in a loop until done · know when you need one and when you don't.
MCP: structured tools for AI
Model Context Protocol is the USB-C of AI tooling · learn the shape before you wire anything.
Skill primers: teach a session your context in 30 seconds
A skill is a reusable file that primes a fresh AI session with your project, voice, and rules · stop re-explaining yourself.
Local models with Ollama
Run Llama, Qwen, or Mistral on your own laptop · no API, no logs, no monthly bill for the work that should stay home.
Audio and Whisper transcription
Whisper turns audio into text · meetings, voice memos, interviews · the AI-era replacement for note-taking.
RAG vs long context: when to retrieve, when to dump
RAG fetches the right slice of your data at query time · long context stuffs everything in · know which problem you actually have.
Embeddings: meaning as numbers
An embedding is a list of numbers that captures the meaning of text · learn the shape and you unlock semantic search, deduplication, and clustering.
Fine-tuning vs prompt engineering
For individuals, fine-tuning is almost never worth it · know exactly when it actually is.
AI safety in personal use
PII, NDAs, financial data, and other people's secrets · know the rules of what you do not paste.
Multimodal prompting: combining text, image, audio
The strongest prompts use the medium that fits the question · sometimes you describe, sometimes you show, sometimes you do both.
Chain-of-thought: making the model show its work
Asking the model to reason step-by-step before answering raises accuracy on hard problems · know when it earns its cost.
Tool use and structured output
Function calling makes the model return JSON your code can use · know the contract before you build on it.
Cost optimization: tokens, caching, model selection
AI is metered · the operators who stay profitable measure what they spend and choose the model that fits the task.
::part of the AtomEons /learn curriculum · 45 lessons · 5 levels · cc-by 4.0