A black sphere, a black cube, and a black cylinder on dark slate — three modalities, one form factor.

Multimodal model atlas

Vision, audio, video, and cross-modal systems · architectures, trade-offs, and recommended use

"Multimodal" is the working term for systems that read or write more than one kind of signal. A model that takes text in and produces an image is multimodal. A model that watches a video clip and answers a question about it is multimodal. A model that listens to a microphone and emits a transcript is multimodal. The label is loose because the field is moving fast, but the underlying architectural choices are not infinite — and that's what this atlas is for. The map below covers the systems that mattered most between 2021 and mid-2026: contrastive vision-language models that learned the joint embedding trick (CLIP), the diffusion-based image generators that grew out of that work (DALL-E, Stable Diffusion), the cross-attention vision-language models that gave LLMs eyes (Flamingo, BLIP, LLaVA), the production-grade frontier systems with native multimodality (Gemini 1.5, GPT-4o, Claude 3.5 vision), and the parallel branch that handles audio and video (Whisper, MusicGen, Sora, Veo, ElevenLabs, Suno, Udio). For each system we mark: what modalities it reads and writes, the architectural pattern it belongs to (joint embedding · cross-attention · late fusion · diffusion), recommended use, and honest notes on cost and quality. Where we don't have current numbers — and pricing in this market shifts month to month — we say so and point you at the provider's docs. As of June 2026, the boundary between "language model" and "multimodal model" is mostly gone at the frontier: Gemini, GPT-4o, and Claude are all natively multimodal at the input layer. The interesting choices have moved downstream — which architectural pattern fits which problem, when joint embedding wins over cross-attention, when diffusion beats autoregressive generation, and when a smaller specialist beats a larger generalist. No marketing speak. No invented benchmarks. Real arxiv IDs and product URLs throughout.

The four architectural patterns

Every system in this atlas fits one of four patterns. Knowing which pattern a model uses tells you most of what you need to know about its strengths, failure modes, and cost.

Joint embedding

CLIP · ALIGN · SigLIP

Train two encoders (e.g. image + text) so that paired inputs land near each other in a shared vector space. CLIP is the canonical example. Cheap at inference, great for retrieval and zero-shot classification, but the model doesn't generate — it scores. The shared space is the product.

Cross-attention fusion

Flamingo · BLIP-2 · LLaVA

A frozen or partially frozen language model receives visual tokens via cross-attention layers, often through a Perceiver-style resampler that compresses an image into a few hundred queryable tokens. Flamingo introduced this; BLIP-2 refined it with the Q-Former. Lets you bolt vision onto a strong LLM without retraining the LLM.

Native multimodal (early fusion)

Gemini 1.5 · GPT-4o · Chameleon

Tokens from all modalities flow through the same transformer from the start. Gemini 1.5 and GPT-4o are the production examples. More expensive to train but eliminates the modality bottleneck — the model can reason across modalities at every layer instead of only at the join.

Diffusion / iterative refinement

Stable Diffusion · DALL-E 2/3 · Sora · Veo

Start from noise, denoise toward the target distribution conditioned on a prompt. Stable Diffusion, DALL-E 2/3, Sora, and Veo all use variants of this. Slow at inference (many steps) but the quality ceiling is currently higher than autoregressive image generation. Sora and Veo extend the pattern to video by adding a temporal dimension.

A chronological pass through the field

The order matters because each system was a reaction to what came before. CLIP made joint embedding cheap; DALL-E showed it could generate; Stable Diffusion made it open; Flamingo gave LLMs eyes; GPT-4o made native multimodality production-grade.

Jan 2021
CLIP (OpenAI)
Radford et al. release 'Learning Transferable Visual Models From Natural Language Supervision' (arXiv:2103.00020). Contrastive training on 400M image-text pairs from the web. Zero-shot ImageNet competitive with supervised ResNet-50. The joint embedding space became the substrate for almost everything that followed in image generation and retrieval.
Jan 2021
DALL-E (OpenAI)
First public text-to-image system from OpenAI, autoregressive transformer over discrete VQ-VAE image tokens. Demonstrated the capability but the quality bar was set higher by DALL-E 2 (April 2022, diffusion-based, used CLIP latents) and DALL-E 3 (Sep 2023, integrated into ChatGPT).
Aug 2022
Stable Diffusion (CompVis · Stability AI · RunwayML)
Latent diffusion model released under a permissive license. Rombach et al., 'High-Resolution Image Synthesis with Latent Diffusion Models' (arXiv:2112.10752). The first frontier-quality image generator with open weights — fundamentally reshaped the open-source ecosystem and downstream tooling.
Apr 2022
Flamingo (DeepMind)
Alayrac et al., 'Flamingo: a Visual Language Model for Few-Shot Learning' (arXiv:2204.14198). Cross-attention from a frozen LLM to a Perceiver resampler over visual features. Strong few-shot vision-language performance without fine-tuning. Set the template that BLIP-2 and LLaVA later followed in open form.
Sep 2022
Whisper (OpenAI)
Radford et al., 'Robust Speech Recognition via Large-Scale Weak Supervision' (arXiv:2212.04356). 680k hours of multilingual audio. Open weights. Became the default open-source speech-to-text engine almost overnight; still competitive in 2026 for general-purpose transcription.
Jan–Jun 2023
BLIP-2 + LLaVA
BLIP-2 (Li et al., arXiv:2301.12597) introduced the Q-Former bridging module. LLaVA (Liu et al., arXiv:2304.08485) showed you could get strong visual instruction following with a simple linear projection from CLIP features into a Vicuna/LLaMA token stream. LLaVA made open multimodal LLMs practical for hobbyists.
Mar 2023
GPT-4 with vision (GPT-4V)
Multimodal input added to GPT-4. Rollout was staged; full availability came in late 2023. System card published by OpenAI documents safety mitigations and capability scope.
Feb 2024
Gemini 1.5 (Google DeepMind)
Gemini Team, 'Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context' (arXiv:2403.05530). Native multimodality (text, image, audio, video) with a 1M-token context window. The long-context multimodal capability was the headline.
May 2024
GPT-4o (OpenAI)
End-to-end multimodal model with native speech-in and speech-out, low-latency voice mode. Announced May 2024. The 'omni' refers to text, audio, and image being processed by one model rather than a pipeline of specialists.
Feb 2024
Sora (OpenAI text-to-video preview)
OpenAI announced Sora in February 2024 with a technical report describing a diffusion transformer (DiT) over spacetime patches. Public access expanded through late 2024. Veo (Google DeepMind) and Veo 2 followed as the main competitor.
Jun 2024
Claude 3.5 Sonnet with vision (Anthropic)
Claude 3.5 Sonnet shipped with image input. Strong on document understanding, charts, and screenshots — a noted strength relative to other vision-capable LLMs at release. Anthropic publishes the model card and capability scope on their site.

Vision-language models · the working index

The systems below either take images as input, produce images as output, or both. Pattern column maps to the four architectural categories above. Pricing notes are best-effort as of June 2026 — provider docs are the source of truth.

System	Inputs → outputs	Pattern	Recommended use	Notes
CLIP (OpenAI, 2021)	Image + text → joint embedding	Joint embedding	Zero-shot classification, retrieval, embedding feature for downstream models	Open weights · arXiv:2103.00020 · the foundation of many later systems
DALL-E 3 (OpenAI)	Text → image	Diffusion (autoregressive prior + diffusion decoder lineage)	High-quality prompt-following image generation	Available via ChatGPT and the OpenAI API · check OpenAI pricing page
Stable Diffusion (CompVis · Stability AI)	Text → image · image+text → image	Latent diffusion	Open-weights image generation, fine-tuning, ControlNet, custom workflows	arXiv:2112.10752 · run locally with reasonable GPU
Flamingo (DeepMind)	Image+text interleaved → text	Cross-attention fusion	Few-shot vision-language tasks (research)	arXiv:2204.14198 · not released publicly · template followed by open models
BLIP-2 (Salesforce)	Image+text → text	Cross-attention fusion (Q-Former)	Captioning, VQA, image-grounded chat with smaller compute budget	arXiv:2301.12597 · open weights · efficient bridge from vision encoder to LLM
LLaVA (Liu et al.)	Image+text → text	Cross-attention / linear projection	Open visual instruction following	arXiv:2304.08485 · LLaVA-1.5 and LLaVA-NeXT improved quality substantially
CogVLM (Zhipu AI / THUDM)	Image+text → text	Cross-attention (deep visual expert)	Open vision-language with strong benchmark performance	arXiv:2311.03079 · weights on Hugging Face
Gemini 1.5 Pro (Google)	Text, image, audio, video → text	Native multimodal · long context	Long-document and long-video understanding, multimodal reasoning at scale	arXiv:2403.05530 · 1M+ token context · check Google AI Studio pricing
GPT-4o (OpenAI)	Text, image, audio → text, audio, image	Native multimodal (early fusion)	Real-time voice interaction, general-purpose multimodal assistant	Released May 2024 · check OpenAI pricing page
Claude 3.5 Sonnet vision (Anthropic)	Text, image → text	Native multimodal	Document/chart/screenshot understanding, code-from-screenshot workflows	Check Anthropic console for current pricing · noted strength on visual-document tasks

SystemCLIP (OpenAI, 2021)

Inputs → outputsImage + text → joint embedding

PatternJoint embedding

Recommended useZero-shot classification, retrieval, embedding feature for downstream models

NotesOpen weights · arXiv:2103.00020 · the foundation of many later systems

SystemDALL-E 3 (OpenAI)

Inputs → outputsText → image

PatternDiffusion (autoregressive prior + diffusion decoder lineage)

Recommended useHigh-quality prompt-following image generation

NotesAvailable via ChatGPT and the OpenAI API · check OpenAI pricing page

SystemStable Diffusion (CompVis · Stability AI)

Inputs → outputsText → image · image+text → image

PatternLatent diffusion

Recommended useOpen-weights image generation, fine-tuning, ControlNet, custom workflows

NotesarXiv:2112.10752 · run locally with reasonable GPU

SystemFlamingo (DeepMind)

Inputs → outputsImage+text interleaved → text

PatternCross-attention fusion

Recommended useFew-shot vision-language tasks (research)

NotesarXiv:2204.14198 · not released publicly · template followed by open models

SystemBLIP-2 (Salesforce)

Inputs → outputsImage+text → text

PatternCross-attention fusion (Q-Former)

Recommended useCaptioning, VQA, image-grounded chat with smaller compute budget

NotesarXiv:2301.12597 · open weights · efficient bridge from vision encoder to LLM

SystemLLaVA (Liu et al.)

Inputs → outputsImage+text → text

PatternCross-attention / linear projection

Recommended useOpen visual instruction following

NotesarXiv:2304.08485 · LLaVA-1.5 and LLaVA-NeXT improved quality substantially

SystemCogVLM (Zhipu AI / THUDM)

Inputs → outputsImage+text → text

PatternCross-attention (deep visual expert)

Recommended useOpen vision-language with strong benchmark performance

NotesarXiv:2311.03079 · weights on Hugging Face

SystemGemini 1.5 Pro (Google)

Inputs → outputsText, image, audio, video → text

PatternNative multimodal · long context

Recommended useLong-document and long-video understanding, multimodal reasoning at scale

NotesarXiv:2403.05530 · 1M+ token context · check Google AI Studio pricing

SystemGPT-4o (OpenAI)

Inputs → outputsText, image, audio → text, audio, image

PatternNative multimodal (early fusion)

Recommended useReal-time voice interaction, general-purpose multimodal assistant

NotesReleased May 2024 · check OpenAI pricing page

SystemClaude 3.5 Sonnet vision (Anthropic)

Inputs → outputsText, image → text

PatternNative multimodal

Recommended useDocument/chart/screenshot understanding, code-from-screenshot workflows

NotesCheck Anthropic console for current pricing · noted strength on visual-document tasks

Audio and music · transcription, generation, and voice

The audio branch evolved more slowly than vision but has its own canon. Whisper became the default for transcription; MusicGen and AudioLM were the research milestones for music; Suno, Udio, and ElevenLabs took the production crown for music and voice.

System	Inputs → outputs	Pattern	Recommended use	Notes
Whisper (OpenAI)	Audio → text	Transformer encoder-decoder	General-purpose speech-to-text, multilingual transcription, translation	arXiv:2212.04356 · open weights · 99 languages
AudioLM (Google)	Audio → audio (continuation)	Hierarchical autoregressive over semantic + acoustic tokens	Research baseline for high-fidelity audio generation	arXiv:2209.03143 · seeded much of the audio-token work
MusicGen (Meta)	Text or melody → music	Autoregressive over EnCodec tokens	Open-weights music generation for prototyping and research	arXiv:2306.05284 · part of Meta AudioCraft · permissive license for research, see model card for commercial scope
Suno	Text → song (vocals + instrumental)	Proprietary · multi-stage generative	End-user song generation with lyrics and vocals	Suno.ai · check suno.com for current plans
Udio	Text → song	Proprietary · multi-stage generative	Similar surface to Suno · different aesthetic	udio.com · check site for current plans and legal scope
ElevenLabs	Text + reference voice → speech	Proprietary · neural TTS with cloning	Voice cloning, audiobooks, narration, multilingual dubbing	elevenlabs.io · check pricing for current tier structure; consent and likeness rules matter

SystemWhisper (OpenAI)

Inputs → outputsAudio → text

PatternTransformer encoder-decoder

Recommended useGeneral-purpose speech-to-text, multilingual transcription, translation

NotesarXiv:2212.04356 · open weights · 99 languages

SystemAudioLM (Google)

Inputs → outputsAudio → audio (continuation)

PatternHierarchical autoregressive over semantic + acoustic tokens

Recommended useResearch baseline for high-fidelity audio generation

NotesarXiv:2209.03143 · seeded much of the audio-token work

SystemMusicGen (Meta)

Inputs → outputsText or melody → music

PatternAutoregressive over EnCodec tokens

Recommended useOpen-weights music generation for prototyping and research

NotesarXiv:2306.05284 · part of Meta AudioCraft · permissive license for research, see model card for commercial scope

SystemSuno

Inputs → outputsText → song (vocals + instrumental)

PatternProprietary · multi-stage generative

Recommended useEnd-user song generation with lyrics and vocals

NotesSuno.ai · check suno.com for current plans

SystemUdio

Inputs → outputsText → song

PatternProprietary · multi-stage generative

Recommended useSimilar surface to Suno · different aesthetic

Notesudio.com · check site for current plans and legal scope

SystemElevenLabs

Inputs → outputsText + reference voice → speech

PatternProprietary · neural TTS with cloning

Recommended useVoice cloning, audiobooks, narration, multilingual dubbing

Noteselevenlabs.io · check pricing for current tier structure; consent and likeness rules matter

Video · the newest and least settled branch

Text-to-video lagged behind text-to-image by roughly two years because video requires modeling time as well as space and the compute cost scales steeply. As of June 2026, the production-grade systems are still in limited or staged release, and clip lengths remain short by film standards. Check provider docs for current limits.

Sora (OpenAI)

Diffusion transformer · OpenAI

Diffusion transformer (DiT) over spacetime patches. Announced February 2024, expanded public access through late 2024. Strengths: prompt following, world simulation. Limits: physical-consistency failures still common; check OpenAI documentation for current clip-length and access tier. Read the technical report at openai.com/research/video-generation-models-as-world-simulators.

Veo (Google DeepMind)

Diffusion · Google DeepMind

Google's text-to-video family, latest generation supports longer clips and improved physical realism. Available via Vertex AI and select consumer surfaces (e.g. integrated into Google products over time). Check deepmind.google for the current Veo generation and capability scope.

Runway Gen-3 / Gen-4

Diffusion · Runway

Runway has shipped successive video-generation models with strong creator-tool integration (motion brushes, camera controls, image-to-video). Independent of the OpenAI / Google duopoly. Check runwayml.com for current model availability.

Open-weights video models

Mixed · open ecosystem

The open-weights side has moved more slowly than image generation but real progress exists (Stable Video Diffusion from Stability AI, and follow-on community models). Quality gap to the closed frontier is larger here than for images. As of June 2026 the practical recommendation for production-grade video is still a hosted API.

Which pattern for which problem

A heuristic guide to picking a pattern when you know your problem. These are not laws — the field moves and exceptions exist — but they hold up as defaults.

Retrieval, classification, or scoring with no generation needed → joint embedding (CLIP-family or successor like SigLIP). Cheap to run, fast at scale, and you keep control of the index.
Add vision to an existing strong LLM with limited compute → cross-attention fusion (BLIP-2 / LLaVA-style adapter). You don't retrain the LLM; you train a small projection or Q-Former.
Need state-of-the-art multimodal reasoning at production quality → native multimodal API (Gemini 1.5 Pro, GPT-4o, Claude 3.5 with vision). You pay per token but you don't operate the infrastructure.
High-quality image generation with controllable style and reproducibility → diffusion (Stable Diffusion for open-weights and tuning; DALL-E 3 for hosted prompt-following).
Audio transcription with predictable cost → Whisper, run locally if you have a GPU or via the OpenAI API for low volume.
Voice cloning or natural narration → hosted neural TTS (ElevenLabs is the obvious option; consent and likeness law matters here).
Music generation as an end product → Suno or Udio. As a research baseline or for open tuning → MusicGen.
Text-to-video as of June 2026 → hosted API (Sora, Veo, Runway). Open-weights options exist but the quality gap is real.

Honest notes on cost and quality

Pricing in this market is not stable. Public APIs are repriced quarterly or faster, free tiers expand and contract, and per-modality costs (per image, per minute of audio, per second of video) shift as the underlying compute economics shift. Anything I quote here would be wrong in three months. Check the provider's current pricing page before you build a cost model. Quality is also not stable. A model that was best-in-class on a benchmark in 2024 is rarely best-in-class in 2026, and benchmark contamination is a real problem — public test sets leak into training data. If quality matters for your use case, build a small private eval set (50-200 examples representative of your actual workload) and rerun it whenever you consider switching models. Anthropic, OpenAI, and Google all publish system cards or technical reports that describe known limitations — read them before you ship. One durable observation: native multimodal systems (Gemini 1.5, GPT-4o, Claude 3.5) tend to be more expensive per request but cheaper per outcome than chains of specialist models, because the latency and integration cost of a multi-step pipeline often dominates. Native multimodal also avoids modality-bottleneck failures where information that should cross from vision to language gets compressed at the join. This is not always true — for high-volume transcription, dedicated Whisper is still cheaper than GPT-4o audio per minute — but it is true often enough to be the right default starting point.</br>

Common failure modes

Things that look fine in a demo and fail in production. None of these are unique to any vendor; they're properties of the architectural patterns.

OCR drift in vision-language models. Even strong systems misread small text, numerical values in dense tables, and handwritten characters. Verify any structured extraction with rules or a second pass.
Hallucinated objects. Vision-language models will sometimes confidently describe objects that are not in the image, especially under ambiguous prompts. Lower temperature and explicit instructions help; they don't eliminate it.
Physical-consistency failure in video. Sora, Veo, and Runway can all produce clips where physics, object permanence, or geometry break partway through. Cherry-picked demos hide this; A/B-tested user studies surface it.
Speaker confusion in long audio. Multi-speaker transcription degrades on overlapping speech and accent diversity. Whisper handles single-speaker well; multi-speaker may need a separate diarization pass.
License surface for generated media. Output ownership, training-data provenance, and consent rules differ by vendor and jurisdiction. Read the provider's terms before commercial use.
Embedding-space staleness. CLIP and successors were trained on a fixed snapshot of the web; concepts that emerged after training are not represented well. Re-embedding may require model rotation.

A short reading list

If you're going to read primary sources rather than secondary summaries — and you should — these are the papers and reports that ground the field. All are linked in the citations panel below.

Radford et al., 'Learning Transferable Visual Models From Natural Language Supervision' (CLIP, 2021).
Rombach et al., 'High-Resolution Image Synthesis with Latent Diffusion Models' (Stable Diffusion, 2022).
Alayrac et al., 'Flamingo: a Visual Language Model for Few-Shot Learning' (2022).
Li et al., 'BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models' (2023).
Liu et al., 'Visual Instruction Tuning' (LLaVA, 2023).
Gemini Team, 'Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context' (2024).
Radford et al., 'Robust Speech Recognition via Large-Scale Weak Supervision' (Whisper, 2022).
Copet et al., 'Simple and Controllable Music Generation' (MusicGen, 2023).
OpenAI Sora technical report, 'Video generation models as world simulators' (2024).

Sources

[01]
CLIP — Radford et al., 'Learning Transferable Visual Models From Natural Language Supervision', the canonical joint-embedding vision-language paper.
arxiv.org/abs/2103.00020
[02]
Stable Diffusion / Latent Diffusion — Rombach et al., 'High-Resolution Image Synthesis with Latent Diffusion Models'.
arxiv.org/abs/2112.10752
[03]
Flamingo — Alayrac et al., DeepMind, cross-attention vision-language model with Perceiver resampler.
arxiv.org/abs/2204.14198
[04]
BLIP-2 — Li et al., Salesforce, introducing the Q-Former bridge between vision encoder and LLM.
arxiv.org/abs/2301.12597
[05]
LLaVA — Liu et al., 'Visual Instruction Tuning', open visual instruction-following model.
arxiv.org/abs/2304.08485
[06]
CogVLM — Wang et al., open vision-language model from Zhipu AI with deep visual expert.
arxiv.org/abs/2311.03079
[07]
Gemini 1.5 technical report, Google DeepMind, describing long-context native multimodal model.
arxiv.org/abs/2403.05530
[08]
Whisper — Radford et al., OpenAI, 'Robust Speech Recognition via Large-Scale Weak Supervision'.
arxiv.org/abs/2212.04356
[09]
AudioLM — Borsos et al., Google, hierarchical language modeling approach to audio generation.
arxiv.org/abs/2209.03143
[10]
MusicGen — Copet et al., Meta, 'Simple and Controllable Music Generation' over EnCodec tokens.
arxiv.org/abs/2306.05284
[11]
Sora technical report, OpenAI, February 2024, describing diffusion transformer over spacetime patches.
openai.com/research/video-generation-models-as-world-simulators
[12]
DALL-E 3 product page, OpenAI, integrated into ChatGPT.
openai.com/index/dall-e-3
[13]
GPT-4o announcement, OpenAI, May 2024, native multimodal model with audio in/out.
openai.com/index/hello-gpt-4o
[14]
Claude 3.5 Sonnet announcement, Anthropic, including vision capability.
anthropic.com/news/claude-3-5-sonnet
[15]
Veo product page, Google DeepMind, text-to-video model family.
deepmind.google/technologies/veo/
[16]
Stable Diffusion public release announcement, Stability AI, August 2022.
stability.ai/news/stable-diffusion-public-release
[17]
Whisper open-source repository and model weights, OpenAI.
github.com/openai/whisper
[18]
Meta AudioCraft repository hosting MusicGen and related audio models.
github.com/facebookresearch/audiocraft
[19]
Suno music generation product page.
suno.com
[20]
Udio music generation product page.
udio.com
[21]
ElevenLabs voice synthesis and cloning product page.
elevenlabs.io
[22]
Runway research page describing Gen-3 and Gen-4 video generation models.
runwayml.com/research
[23]
GPT-4 technical report, OpenAI, the basis for GPT-4V multimodal capability.
arxiv.org/abs/2303.08774

Keep reading

Learn · how AI works →Atlas · language model timeline →Atlas · open vs closed models →Compare · GPT-4o vs Claude 3.5 vs Gemini 1.5 →Research · diffusion vs autoregressive generation →Tools · multimodal pipeline starters →OrangeBox · local AI rig setup →

Multimodal model atlas

The four architectural patterns

Joint embedding

Cross-attention fusion

Native multimodal (early fusion)

Diffusion / iterative refinement

A chronological pass through the field

CLIP (OpenAI)

DALL-E (OpenAI)

Stable Diffusion (CompVis · Stability AI · RunwayML)

Flamingo (DeepMind)

Whisper (OpenAI)

BLIP-2 + LLaVA

GPT-4 with vision (GPT-4V)

Gemini 1.5 (Google DeepMind)

GPT-4o (OpenAI)

Sora (OpenAI text-to-video preview)

Claude 3.5 Sonnet with vision (Anthropic)

Vision-language models · the working index

Audio and music · transcription, generation, and voice

Video · the newest and least settled branch

Sora (OpenAI)

Veo (Google DeepMind)

Runway Gen-3 / Gen-4

Open-weights video models

Which pattern for which problem

Honest notes on cost and quality

Common failure modes

A short reading list

Sources

Keep reading