Multimodal prompting: combining text, image, audio
The strongest prompts use the medium that fits the question · sometimes you describe, sometimes you show, sometimes you do both.
::TL;DR · the whole lesson in three lines
- MOVEThe strongest prompts use the medium that fits the question · sometimes you describe, sometimes you show, sometimes you do both.
- DRILLYou will take one of your real recent questions and re-ask it three different ways · text-only, image-included, image-plus-text · and rank the answers.
- WINYou have three responses to compare on the same real question.
::concept · what's actually happening
Multimodal prompting is the practice of choosing which inputs to give the model based on what fits the question, not on what you happen to have. A screenshot plus a question beats a 400-word description plus 'you know what I mean.' A voice memo plus a transcript beats either one alone for capturing what you actually meant.
read full concept · 4 more paragraphs →collapse concept ↑
The combination unlock is real · text describes intent, image carries unambiguous reference, audio captures inflection and pacing. A debugging session that pairs 'here is what I expected to happen' (text) with a screenshot of what actually happened (image) gives the model both halves of the gap.
Modalities have asymmetric strengths · text is precise but verbose, images are unambiguous but spatial-only, audio is rich but slow to scan. Match the modality to what you cannot say easily in the others. If your question is 'why is this layout broken,' words are the wrong tool.
Token costs differ across modalities · a high-res image can cost 1000-2000 tokens, audio costs scale with duration, text is cheapest per unit of content. Multimodal prompts can quietly become expensive prompts if you paste in five 4K screenshots without thinking.
The frontier models are still learning · they handle two modalities (text + image, text + audio) reliably, but three-way reasoning ('look at this image, listen to this audio, read these notes') still degrades. Save the three-way prompts for high-value work where the cost is justified.
::drill · do the thing
You will take one of your real recent questions and re-ask it three different ways · text-only, image-included, image-plus-text · and rank the answers.
::L40 drill · copy-paste into any AI chat
I want to practice multimodal prompting on a real question I had this week. The question is: [PASTE OR DESCRIBE THE REAL QUESTION]. The relevant artifact (if any) is [SCREENSHOT / DIAGRAM / RECORDING / PHOTO]. I will ask the same question three ways and want your honest critique each time: Round 1 · text-only, describing the situation as precisely as I can. Round 2 · attach the artifact with minimal text ('what do you see here?'). Round 3 · attach the artifact AND write a sharp text frame around it (here is what I expect, here is what I see, here is what I want to know). For each round, tell me what helped, what was missing, and what I should have included. End with a one-line rule of thumb for when to reach for each mode in tasks of this shape.::steps
- 01Find one real question you had recently that involved a visual or audio artifact.
- 02Run round 1 · text-only version.
- 03Run round 2 · artifact with minimal framing.
- 04Run round 3 · artifact plus tight text frame around it.
- 05Rank the three answers and note where each fell short.
- 06Write down your one-line rule of thumb · 'for X-shaped questions, default to Y.'
::outcome · what should be true
- You have three responses to compare on the same real question.
- You can name which round won and articulate why.
- You wrote down a rule of thumb for that question-shape.
- You noticed at least one modality that did not pull its weight.
::trap · the most common failure
Operators dump every available artifact into every prompt because 'more context is better' · then watch the model get lost trying to reconcile three loosely-related inputs. Pick the modality the question needs. Skip the rest.
::end of the curriculum
You're at Pilot level. There's no Level 6.
The next move is doing the work, not another lesson. If you want operator-grade infrastructure, that's /orangebox. If you want the lab's working journal, /founders-view. If you want to collaborate on the curriculum itself, the source is public on GitHub.
::other lessons at Operator level
Local AI · Ollama — privacy, offline, and the limit of free
At Operator level you need an honest opinion about local-only AI. Even if you don't use it daily, you should have run it once.
Model routing — switching between Claude, GPT, Gemini mid-task
Operators don't pick one AI. They route each task to the model that does it best. Knowing the strengths is the skill.
MCP servers — the plug socket that turned AI into a real tool
Model Context Protocol is the standard plug. Knowing what plugs in changes what your AI can actually touch — your files, your inbox, your calendar, your repos.
Agent mode — when AI takes action, not just answers
The frontier of useful AI is agents that DO things — browse, click, file, send. The actual skill is the safety pattern, not the magic.
Computer use — when AI takes the mouse and keyboard
Claude in Chrome, ChatGPT Atlas, computer-use beta — the frontier is AI that drives your browser like a human. Knowing the safety pattern is the actual skill.
What AI cannot replace — taste, judgment, relationships
The operators winning in 2026 are the ones who learned what AI is for and what is theirs. Knowing the line is more valuable than any prompt.
Agents 101: model plus tools plus loop
An agent is a model with tools running in a loop until done · know when you need one and when you don't.
MCP: structured tools for AI
Model Context Protocol is the USB-C of AI tooling · learn the shape before you wire anything.
Skill primers: teach a session your context in 30 seconds
A skill is a reusable file that primes a fresh AI session with your project, voice, and rules · stop re-explaining yourself.
Local models with Ollama
Run Llama, Qwen, or Mistral on your own laptop · no API, no logs, no monthly bill for the work that should stay home.
Vision models: when to use them
Vision lets the model see images · powerful for screenshots and diagrams · weak for precise spatial work · know the line.
Audio and Whisper transcription
Whisper turns audio into text · meetings, voice memos, interviews · the AI-era replacement for note-taking.
RAG vs long context: when to retrieve, when to dump
RAG fetches the right slice of your data at query time · long context stuffs everything in · know which problem you actually have.
Embeddings: meaning as numbers
An embedding is a list of numbers that captures the meaning of text · learn the shape and you unlock semantic search, deduplication, and clustering.
Fine-tuning vs prompt engineering
For individuals, fine-tuning is almost never worth it · know exactly when it actually is.
AI safety in personal use
PII, NDAs, financial data, and other people's secrets · know the rules of what you do not paste.
Chain-of-thought: making the model show its work
Asking the model to reason step-by-step before answering raises accuracy on hard problems · know when it earns its cost.
Tool use and structured output
Function calling makes the model return JSON your code can use · know the contract before you build on it.
Cost optimization: tokens, caching, model selection
AI is metered · the operators who stay profitable measure what they spend and choose the model that fits the task.
::part of the AtomEons /learn curriculum · 45 lessons · 5 levels · cc-by 4.0