L40 · Operator~20 min · free · cc-by 4.0

Multimodal prompting: combining text, image, audio

The strongest prompts use the medium that fits the question · sometimes you describe, sometimes you show, sometimes you do both.

::TL;DR · the whole lesson in three lines

MOVEThe strongest prompts use the medium that fits the question · sometimes you describe, sometimes you show, sometimes you do both.
DRILLYou will take one of your real recent questions and re-ask it three different ways · text-only, image-included, image-plus-text · and rank the answers.
WINYou have three responses to compare on the same real question.

jump to drill ↓or read the full concept first →

::concept · what's actually happening

Multimodal prompting is the practice of choosing which inputs to give the model based on what fits the question, not on what you happen to have. A screenshot plus a question beats a 400-word description plus 'you know what I mean.' A voice memo plus a transcript beats either one alone for capturing what you actually meant.

read full concept · 4 more paragraphs →

The combination unlock is real · text describes intent, image carries unambiguous reference, audio captures inflection and pacing. A debugging session that pairs 'here is what I expected to happen' (text) with a screenshot of what actually happened (image) gives the model both halves of the gap.

Modalities have asymmetric strengths · text is precise but verbose, images are unambiguous but spatial-only, audio is rich but slow to scan. Match the modality to what you cannot say easily in the others. If your question is 'why is this layout broken,' words are the wrong tool.

Token costs differ across modalities · a high-res image can cost 1000-2000 tokens, audio costs scale with duration, text is cheapest per unit of content. Multimodal prompts can quietly become expensive prompts if you paste in five 4K screenshots without thinking.

The frontier models are still learning · they handle two modalities (text + image, text + audio) reliably, but three-way reasoning ('look at this image, listen to this audio, read these notes') still degrades. Save the three-way prompts for high-value work where the cost is justified.

::drill · do the thing

You will take one of your real recent questions and re-ask it three different ways · text-only, image-included, image-plus-text · and rank the answers.

::L40 drill · copy-paste into any AI chat

I want to practice multimodal prompting on a real question I had this week. The question is: [PASTE OR DESCRIBE THE REAL QUESTION]. The relevant artifact (if any) is [SCREENSHOT / DIAGRAM / RECORDING / PHOTO]. I will ask the same question three ways and want your honest critique each time: Round 1 · text-only, describing the situation as precisely as I can. Round 2 · attach the artifact with minimal text ('what do you see here?'). Round 3 · attach the artifact AND write a sharp text frame around it (here is what I expect, here is what I see, here is what I want to know). For each round, tell me what helped, what was missing, and what I should have included. End with a one-line rule of thumb for when to reach for each mode in tasks of this shape.

I want to practice multimodal prompting on a real question I had this week. The question is: [PASTE OR DESCRIBE THE REAL QUESTION]. The relevant artifact (if any) is [SCREENSHOT / DIAGRAM / RECORDING / PHOTO]. I will ask the same question three ways and want your honest critique each time: Round 1 · text-only, describing the situation as precisely as I can. Round 2 · attach the artifact with minimal text ('what do you see here?'). Round 3 · attach the artifact AND write a sharp text frame around it (here is what I expect, here is what I see, here is what I want to know). For each round, tell me what helped, what was missing, and what I should have included. End with a one-line rule of thumb for when to reach for each mode in tasks of this shape.

::or open one in a new tab — then paste

Claude↗ChatGPT↗Gemini↗

::steps

01Find one real question you had recently that involved a visual or audio artifact.
02Run round 1 · text-only version.
03Run round 2 · artifact with minimal framing.
04Run round 3 · artifact plus tight text frame around it.
05Rank the three answers and note where each fell short.
06Write down your one-line rule of thumb · 'for X-shaped questions, default to Y.'

::outcome · what should be true

You have three responses to compare on the same real question.
You can name which round won and articulate why.
You wrote down a rule of thumb for that question-shape.
You noticed at least one modality that did not pull its weight.

::trap · the most common failure

Operators dump every available artifact into every prompt because 'more context is better' · then watch the model get lost trying to reconcile three loosely-related inputs. Pick the modality the question needs. Skip the rest.

::end of the curriculum

You're at Pilot level. There's no Level 6.

The next move is doing the work, not another lesson. If you want operator-grade infrastructure, that's /orangebox. If you want the lab's working journal, /founders-view. If you want to collaborate on the curriculum itself, the source is public on GitHub.

::other lessons at Operator level

L10~30 min

← back to /learn full lesson library →

Multimodal prompting: combining text, image, audio

You're at Pilot level. There's no Level 6.

Local AI · Ollama — privacy, offline, and the limit of free

Model routing — switching between Claude, GPT, Gemini mid-task

MCP servers — the plug socket that turned AI into a real tool

Agent mode — when AI takes action, not just answers

Computer use — when AI takes the mouse and keyboard

What AI cannot replace — taste, judgment, relationships

Agents 101: model plus tools plus loop

MCP: structured tools for AI

Skill primers: teach a session your context in 30 seconds

Local models with Ollama

Vision models: when to use them

Audio and Whisper transcription

RAG vs long context: when to retrieve, when to dump

Embeddings: meaning as numbers

Fine-tuning vs prompt engineering

AI safety in personal use

Chain-of-thought: making the model show its work

Tool use and structured output

Cost optimization: tokens, caching, model selection