
Multimodal model atlas
Vision, audio, video, and cross-modal systems · architectures, trade-offs, and recommended use
The four architectural patterns
Every system in this atlas fits one of four patterns. Knowing which pattern a model uses tells you most of what you need to know about its strengths, failure modes, and cost.
Joint embedding
CLIP · ALIGN · SigLIP
Train two encoders (e.g. image + text) so that paired inputs land near each other in a shared vector space. CLIP is the canonical example. Cheap at inference, great for retrieval and zero-shot classification, but the model doesn't generate — it scores. The shared space is the product.
Cross-attention fusion
Flamingo · BLIP-2 · LLaVA
A frozen or partially frozen language model receives visual tokens via cross-attention layers, often through a Perceiver-style resampler that compresses an image into a few hundred queryable tokens. Flamingo introduced this; BLIP-2 refined it with the Q-Former. Lets you bolt vision onto a strong LLM without retraining the LLM.
Native multimodal (early fusion)
Gemini 1.5 · GPT-4o · Chameleon
Tokens from all modalities flow through the same transformer from the start. Gemini 1.5 and GPT-4o are the production examples. More expensive to train but eliminates the modality bottleneck — the model can reason across modalities at every layer instead of only at the join.
Diffusion / iterative refinement
Stable Diffusion · DALL-E 2/3 · Sora · Veo
Start from noise, denoise toward the target distribution conditioned on a prompt. Stable Diffusion, DALL-E 2/3, Sora, and Veo all use variants of this. Slow at inference (many steps) but the quality ceiling is currently higher than autoregressive image generation. Sora and Veo extend the pattern to video by adding a temporal dimension.
A chronological pass through the field
The order matters because each system was a reaction to what came before. CLIP made joint embedding cheap; DALL-E showed it could generate; Stable Diffusion made it open; Flamingo gave LLMs eyes; GPT-4o made native multimodality production-grade.
Jan 2021
CLIP (OpenAI)
Radford et al. release 'Learning Transferable Visual Models From Natural Language Supervision' (arXiv:2103.00020). Contrastive training on 400M image-text pairs from the web. Zero-shot ImageNet competitive with supervised ResNet-50. The joint embedding space became the substrate for almost everything that followed in image generation and retrieval.
Jan 2021
DALL-E (OpenAI)
First public text-to-image system from OpenAI, autoregressive transformer over discrete VQ-VAE image tokens. Demonstrated the capability but the quality bar was set higher by DALL-E 2 (April 2022, diffusion-based, used CLIP latents) and DALL-E 3 (Sep 2023, integrated into ChatGPT).
Aug 2022
Stable Diffusion (CompVis · Stability AI · RunwayML)
Latent diffusion model released under a permissive license. Rombach et al., 'High-Resolution Image Synthesis with Latent Diffusion Models' (arXiv:2112.10752). The first frontier-quality image generator with open weights — fundamentally reshaped the open-source ecosystem and downstream tooling.
Apr 2022
Flamingo (DeepMind)
Alayrac et al., 'Flamingo: a Visual Language Model for Few-Shot Learning' (arXiv:2204.14198). Cross-attention from a frozen LLM to a Perceiver resampler over visual features. Strong few-shot vision-language performance without fine-tuning. Set the template that BLIP-2 and LLaVA later followed in open form.
Sep 2022
Whisper (OpenAI)
Radford et al., 'Robust Speech Recognition via Large-Scale Weak Supervision' (arXiv:2212.04356). 680k hours of multilingual audio. Open weights. Became the default open-source speech-to-text engine almost overnight; still competitive in 2026 for general-purpose transcription.
Jan–Jun 2023
BLIP-2 + LLaVA
BLIP-2 (Li et al., arXiv:2301.12597) introduced the Q-Former bridging module. LLaVA (Liu et al., arXiv:2304.08485) showed you could get strong visual instruction following with a simple linear projection from CLIP features into a Vicuna/LLaMA token stream. LLaVA made open multimodal LLMs practical for hobbyists.
Mar 2023
GPT-4 with vision (GPT-4V)
Multimodal input added to GPT-4. Rollout was staged; full availability came in late 2023. System card published by OpenAI documents safety mitigations and capability scope.
Feb 2024
Gemini 1.5 (Google DeepMind)
Gemini Team, 'Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context' (arXiv:2403.05530). Native multimodality (text, image, audio, video) with a 1M-token context window. The long-context multimodal capability was the headline.
May 2024
GPT-4o (OpenAI)
End-to-end multimodal model with native speech-in and speech-out, low-latency voice mode. Announced May 2024. The 'omni' refers to text, audio, and image being processed by one model rather than a pipeline of specialists.
Feb 2024
Sora (OpenAI text-to-video preview)
OpenAI announced Sora in February 2024 with a technical report describing a diffusion transformer (DiT) over spacetime patches. Public access expanded through late 2024. Veo (Google DeepMind) and Veo 2 followed as the main competitor.
Jun 2024
Claude 3.5 Sonnet with vision (Anthropic)
Claude 3.5 Sonnet shipped with image input. Strong on document understanding, charts, and screenshots — a noted strength relative to other vision-capable LLMs at release. Anthropic publishes the model card and capability scope on their site.
Vision-language models · the working index
The systems below either take images as input, produce images as output, or both. Pattern column maps to the four architectural categories above. Pricing notes are best-effort as of June 2026 — provider docs are the source of truth.
Audio and music · transcription, generation, and voice
The audio branch evolved more slowly than vision but has its own canon. Whisper became the default for transcription; MusicGen and AudioLM were the research milestones for music; Suno, Udio, and ElevenLabs took the production crown for music and voice.
Video · the newest and least settled branch
Text-to-video lagged behind text-to-image by roughly two years because video requires modeling time as well as space and the compute cost scales steeply. As of June 2026, the production-grade systems are still in limited or staged release, and clip lengths remain short by film standards. Check provider docs for current limits.
Sora (OpenAI)
Diffusion transformer · OpenAI
Diffusion transformer (DiT) over spacetime patches. Announced February 2024, expanded public access through late 2024. Strengths: prompt following, world simulation. Limits: physical-consistency failures still common; check OpenAI documentation for current clip-length and access tier. Read the technical report at openai.com/research/video-generation-models-as-world-simulators.
Veo (Google DeepMind)
Diffusion · Google DeepMind
Google's text-to-video family, latest generation supports longer clips and improved physical realism. Available via Vertex AI and select consumer surfaces (e.g. integrated into Google products over time). Check deepmind.google for the current Veo generation and capability scope.
Runway Gen-3 / Gen-4
Diffusion · Runway
Runway has shipped successive video-generation models with strong creator-tool integration (motion brushes, camera controls, image-to-video). Independent of the OpenAI / Google duopoly. Check runwayml.com for current model availability.
Open-weights video models
Mixed · open ecosystem
The open-weights side has moved more slowly than image generation but real progress exists (Stable Video Diffusion from Stability AI, and follow-on community models). Quality gap to the closed frontier is larger here than for images. As of June 2026 the practical recommendation for production-grade video is still a hosted API.
Which pattern for which problem
A heuristic guide to picking a pattern when you know your problem. These are not laws — the field moves and exceptions exist — but they hold up as defaults.
- Retrieval, classification, or scoring with no generation needed → joint embedding (CLIP-family or successor like SigLIP). Cheap to run, fast at scale, and you keep control of the index.
- Add vision to an existing strong LLM with limited compute → cross-attention fusion (BLIP-2 / LLaVA-style adapter). You don't retrain the LLM; you train a small projection or Q-Former.
- Need state-of-the-art multimodal reasoning at production quality → native multimodal API (Gemini 1.5 Pro, GPT-4o, Claude 3.5 with vision). You pay per token but you don't operate the infrastructure.
- High-quality image generation with controllable style and reproducibility → diffusion (Stable Diffusion for open-weights and tuning; DALL-E 3 for hosted prompt-following).
- Audio transcription with predictable cost → Whisper, run locally if you have a GPU or via the OpenAI API for low volume.
- Voice cloning or natural narration → hosted neural TTS (ElevenLabs is the obvious option; consent and likeness law matters here).
- Music generation as an end product → Suno or Udio. As a research baseline or for open tuning → MusicGen.
- Text-to-video as of June 2026 → hosted API (Sora, Veo, Runway). Open-weights options exist but the quality gap is real.
Honest notes on cost and quality
Pricing in this market is not stable. Public APIs are repriced quarterly or faster, free tiers expand and contract, and per-modality costs (per image, per minute of audio, per second of video) shift as the underlying compute economics shift. Anything I quote here would be wrong in three months. Check the provider's current pricing page before you build a cost model. Quality is also not stable. A model that was best-in-class on a benchmark in 2024 is rarely best-in-class in 2026, and benchmark contamination is a real problem — public test sets leak into training data. If quality matters for your use case, build a small private eval set (50-200 examples representative of your actual workload) and rerun it whenever you consider switching models. Anthropic, OpenAI, and Google all publish system cards or technical reports that describe known limitations — read them before you ship. One durable observation: native multimodal systems (Gemini 1.5, GPT-4o, Claude 3.5) tend to be more expensive per request but cheaper per outcome than chains of specialist models, because the latency and integration cost of a multi-step pipeline often dominates. Native multimodal also avoids modality-bottleneck failures where information that should cross from vision to language gets compressed at the join. This is not always true — for high-volume transcription, dedicated Whisper is still cheaper than GPT-4o audio per minute — but it is true often enough to be the right default starting point.</br>
Common failure modes
Things that look fine in a demo and fail in production. None of these are unique to any vendor; they're properties of the architectural patterns.
- OCR drift in vision-language models. Even strong systems misread small text, numerical values in dense tables, and handwritten characters. Verify any structured extraction with rules or a second pass.
- Hallucinated objects. Vision-language models will sometimes confidently describe objects that are not in the image, especially under ambiguous prompts. Lower temperature and explicit instructions help; they don't eliminate it.
- Physical-consistency failure in video. Sora, Veo, and Runway can all produce clips where physics, object permanence, or geometry break partway through. Cherry-picked demos hide this; A/B-tested user studies surface it.
- Speaker confusion in long audio. Multi-speaker transcription degrades on overlapping speech and accent diversity. Whisper handles single-speaker well; multi-speaker may need a separate diarization pass.
- License surface for generated media. Output ownership, training-data provenance, and consent rules differ by vendor and jurisdiction. Read the provider's terms before commercial use.
- Embedding-space staleness. CLIP and successors were trained on a fixed snapshot of the web; concepts that emerged after training are not represented well. Re-embedding may require model rotation.
A short reading list
If you're going to read primary sources rather than secondary summaries — and you should — these are the papers and reports that ground the field. All are linked in the citations panel below.
- Radford et al., 'Learning Transferable Visual Models From Natural Language Supervision' (CLIP, 2021).
- Rombach et al., 'High-Resolution Image Synthesis with Latent Diffusion Models' (Stable Diffusion, 2022).
- Alayrac et al., 'Flamingo: a Visual Language Model for Few-Shot Learning' (2022).
- Li et al., 'BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models' (2023).
- Liu et al., 'Visual Instruction Tuning' (LLaVA, 2023).
- Gemini Team, 'Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context' (2024).
- Radford et al., 'Robust Speech Recognition via Large-Scale Weak Supervision' (Whisper, 2022).
- Copet et al., 'Simple and Controllable Music Generation' (MusicGen, 2023).
- OpenAI Sora technical report, 'Video generation models as world simulators' (2024).
Sources
- [01]
CLIP — Radford et al., 'Learning Transferable Visual Models From Natural Language Supervision', the canonical joint-embedding vision-language paper.
arxiv.org/abs/2103.00020
- [02]
Stable Diffusion / Latent Diffusion — Rombach et al., 'High-Resolution Image Synthesis with Latent Diffusion Models'.
arxiv.org/abs/2112.10752
- [03]
Flamingo — Alayrac et al., DeepMind, cross-attention vision-language model with Perceiver resampler.
arxiv.org/abs/2204.14198
- [04]
BLIP-2 — Li et al., Salesforce, introducing the Q-Former bridge between vision encoder and LLM.
arxiv.org/abs/2301.12597
- [05]
LLaVA — Liu et al., 'Visual Instruction Tuning', open visual instruction-following model.
arxiv.org/abs/2304.08485
- [06]
CogVLM — Wang et al., open vision-language model from Zhipu AI with deep visual expert.
arxiv.org/abs/2311.03079
- [07]
Gemini 1.5 technical report, Google DeepMind, describing long-context native multimodal model.
arxiv.org/abs/2403.05530
- [08]
Whisper — Radford et al., OpenAI, 'Robust Speech Recognition via Large-Scale Weak Supervision'.
arxiv.org/abs/2212.04356
- [09]
AudioLM — Borsos et al., Google, hierarchical language modeling approach to audio generation.
arxiv.org/abs/2209.03143
- [10]
MusicGen — Copet et al., Meta, 'Simple and Controllable Music Generation' over EnCodec tokens.
arxiv.org/abs/2306.05284
- [11]
Sora technical report, OpenAI, February 2024, describing diffusion transformer over spacetime patches.
openai.com/research/video-generation-models-as-world-simulators
- [12]
DALL-E 3 product page, OpenAI, integrated into ChatGPT.
openai.com/index/dall-e-3
- [13]
GPT-4o announcement, OpenAI, May 2024, native multimodal model with audio in/out.
openai.com/index/hello-gpt-4o
- [14]
Claude 3.5 Sonnet announcement, Anthropic, including vision capability.
anthropic.com/news/claude-3-5-sonnet
- [15]
Veo product page, Google DeepMind, text-to-video model family.
deepmind.google/technologies/veo/
- [16]
Stable Diffusion public release announcement, Stability AI, August 2022.
stability.ai/news/stable-diffusion-public-release
- [17]
Whisper open-source repository and model weights, OpenAI.
github.com/openai/whisper
- [18]
Meta AudioCraft repository hosting MusicGen and related audio models.
github.com/facebookresearch/audiocraft
- [19]
Suno music generation product page.
suno.com
- [20]
Udio music generation product page.
udio.com
- [21]
ElevenLabs voice synthesis and cloning product page.
elevenlabs.io
- [22]
Runway research page describing Gen-3 and Gen-4 video generation models.
runwayml.com/research
- [23]
GPT-4 technical report, OpenAI, the basis for GPT-4V multimodal capability.
arxiv.org/abs/2303.08774