built throughORANGEBOX·see what it ships·$1 →

AtomEons / Learn / L34

L34 · Operator~20 min · free · cc-by 4.0

Vision models: when to use them

Vision lets the model see images · powerful for screenshots and diagrams · weak for precise spatial work · know the line.

::TL;DR · the whole lesson in three lines

  • MOVEVision lets the model see images · powerful for screenshots and diagrams · weak for precise spatial work · know the line.
  • DRILLYou will run the same task three ways · text-only, image-only, and text-plus-image · to feel which combination wins for your real work.
  • WINYou have three responses to compare honestly.

::concept · what's actually happening

Vision-capable models can take an image as input alongside text and reason about its contents · they read screenshots, interpret diagrams, describe photos, transcribe handwriting, and find UI elements. The capability changed what 'paste this in' can mean.

read full concept · 4 more paragraphs

The strongest use cases are descriptive and interpretive · 'what does this error message say,' 'summarize this whiteboard photo,' 'is the layout broken in this screenshot,' 'extract the table from this scanned page.' The model reads the image like a literate person scanning a document.

The weakest use cases are precise spatial · 'click exactly here,' 'measure this distance in pixels,' 'count exactly how many widgets are in this photo.' Vision models hallucinate coordinates and miscount past a handful. They are interpreters, not measurement tools.

Mixed text-and-image prompting unlocks workflows that pure text cannot · debugging UI bugs by sharing a screenshot plus describing expected behavior, reviewing design mockups, reading dashboards. The model treats the image as additional context, not as a replacement for instructions.

Privacy gets tricky fast · screenshots often contain incidental PII (email previews, names, account balances), and people upload them without scrubbing. Vision raises the stakes of 'what did I just send?' considerably.

::drill · do the thing

You will run the same task three ways · text-only, image-only, and text-plus-image · to feel which combination wins for your real work.

::L34 drill · copy-paste into any AI chat

I am calibrating when to use vision input. Here is a real task I do: [DESCRIBE THE TASK · e.g. 'debugging why a webpage looks wrong,' 'extracting data from a chart in a PDF']. I will try this three ways and want your honest verdict each time: 1) text-only · I describe the situation in words, you respond. 2) image-only · I paste a screenshot with no description, you respond. 3) text-plus-image · I paste the screenshot AND describe what I want, you respond. For each round, tell me what you can and cannot see clearly, and what you would need from me to do better. After all three, give a one-paragraph verdict on which mode wins for tasks of this shape.

::or open one in a new tab — then paste

::steps

  1. 01Pick a real task where vision might help (UI debugging, chart reading, layout review).
  2. 02Run round 1 · text-only description.
  3. 03Run round 2 · paste screenshot with no description.
  4. 04Run round 3 · screenshot plus your description together.
  5. 05Read the three responses side by side · note quality and effort.
  6. 06Write down which mode you will default to for this task shape.

::outcome · what should be true

  • You have three responses to compare honestly.
  • You can name two task types where vision wins and one where it loses.
  • You have a default mode picked for the task you tested.
  • You scrubbed PII from the screenshot before sharing (or noticed you didn't).

::trap · the most common failure

Operators paste screenshots like reflex and skip describing what they want · the model then guesses what is wrong with the image, often correctly but sometimes spectacularly wrong. Vision is a context channel, not a mind-reading channel.

::end of the curriculum

You're at Pilot level. There's no Level 6.

The next move is doing the work, not another lesson. If you want operator-grade infrastructure, that's /orangebox. If you want the lab's working journal, /founders-view. If you want to collaborate on the curriculum itself, the source is public on GitHub.

::other lessons at Operator level

L10~30 min

Local AI · Ollama — privacy, offline, and the limit of free

At Operator level you need an honest opinion about local-only AI. Even if you don't use it daily, you should have run it once.

L11~25 min

Model routing — switching between Claude, GPT, Gemini mid-task

Operators don't pick one AI. They route each task to the model that does it best. Knowing the strengths is the skill.

L15~25 min

MCP servers — the plug socket that turned AI into a real tool

Model Context Protocol is the standard plug. Knowing what plugs in changes what your AI can actually touch — your files, your inbox, your calendar, your repos.

L16~20 min

Agent mode — when AI takes action, not just answers

The frontier of useful AI is agents that DO things — browse, click, file, send. The actual skill is the safety pattern, not the magic.

L26~22 min

Computer use — when AI takes the mouse and keyboard

Claude in Chrome, ChatGPT Atlas, computer-use beta — the frontier is AI that drives your browser like a human. Knowing the safety pattern is the actual skill.

L27~22 min

What AI cannot replace — taste, judgment, relationships

The operators winning in 2026 are the ones who learned what AI is for and what is theirs. Knowing the line is more valuable than any prompt.

L30~20 min

Agents 101: model plus tools plus loop

An agent is a model with tools running in a loop until done · know when you need one and when you don't.

L31~25 min

MCP: structured tools for AI

Model Context Protocol is the USB-C of AI tooling · learn the shape before you wire anything.

L32~25 min

Skill primers: teach a session your context in 30 seconds

A skill is a reusable file that primes a fresh AI session with your project, voice, and rules · stop re-explaining yourself.

L33~30 min

Local models with Ollama

Run Llama, Qwen, or Mistral on your own laptop · no API, no logs, no monthly bill for the work that should stay home.

L35~25 min

Audio and Whisper transcription

Whisper turns audio into text · meetings, voice memos, interviews · the AI-era replacement for note-taking.

L36~25 min

RAG vs long context: when to retrieve, when to dump

RAG fetches the right slice of your data at query time · long context stuffs everything in · know which problem you actually have.

L37~25 min

Embeddings: meaning as numbers

An embedding is a list of numbers that captures the meaning of text · learn the shape and you unlock semantic search, deduplication, and clustering.

L38~20 min

Fine-tuning vs prompt engineering

For individuals, fine-tuning is almost never worth it · know exactly when it actually is.

L39~20 min

AI safety in personal use

PII, NDAs, financial data, and other people's secrets · know the rules of what you do not paste.

L40~20 min

Multimodal prompting: combining text, image, audio

The strongest prompts use the medium that fits the question · sometimes you describe, sometimes you show, sometimes you do both.

L42~15 min

Chain-of-thought: making the model show its work

Asking the model to reason step-by-step before answering raises accuracy on hard problems · know when it earns its cost.

L43~25 min

Tool use and structured output

Function calling makes the model return JSON your code can use · know the contract before you build on it.

L44~25 min

Cost optimization: tokens, caching, model selection

AI is metered · the operators who stay profitable measure what they spend and choose the model that fits the task.

::part of the AtomEons /learn curriculum · 45 lessons · 5 levels · cc-by 4.0

LAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHMLAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHM