::deep-dive

Multimodal Models

Vision-language models, audio, video, and the architectures that bridge modalities

Language is one modality among many. The frontier of AI is increasingly multimodal — models that see, hear, and (less commonly so far) act in continuous spaces. The doctorate-grade curriculum here begins with CLIP (Contrastive Language-Image Pretraining, Radford et al., 2021), which established the dual-encoder paradigm of joint vision-language embedding spaces and remains the load-bearing model for downstream multimodal work, from text-to-image generation to retrieval. From CLIP you move to the modern visual-language model architectures: Flamingo (Alayrac et al., DeepMind 2022) for the cross-attention-into-frozen-LM pattern; BLIP-2 (Li et al., Salesforce 2023) for the Q-Former projection approach; LLaVA (Liu et al., 2023) and the family of open VLMs that followed, which established that a simple linear projection from a vision encoder into an LLM's embedding space can match much more complex architectures. For diffusion-based generation, the canonical path is the original DDPM paper (Ho, Jain, Abbeel, 2020), then the latent diffusion paper that underlies Stable Diffusion (Rombach et al., 2022), then classifier-free guidance (Ho and Salimans, 2022). For text-to-image alignment and quality, DALL-E 2 (Ramesh et al., 2022) and Imagen (Saharia et al., 2022) define the canonical architectures, even if their successors have iterated significantly. GPT-4V's system card (OpenAI, 2023) is the canonical industry document for understanding the safety and capability evaluation framing of frontier multimodal models. Audio models — Whisper for speech recognition (Radford et al., 2022), MusicLM and AudioLM for generation — round out the modality picture. The unifying theme: most successful multimodal architectures freeze one modality's encoder, project into a shared space, and let cross-attention or a connecting MLP do the bridging work. A doctorate-grade learner should understand both the architectural choices and the data-curation choices that make these models work; multimodal data is much harder to curate than text-only data, and the data side is often where the field-defining work happens.

::reading path · in order

::01 · paper
~6h
Learning Transferable Visual Models From Natural Language Supervision — Radford et al. (CLIP paper, OpenAI 2021)
The contrastive vision-language paradigm. Foundational for everything multimodal that came after.
::02 · paper
~8h
Flamingo: a Visual Language Model for Few-Shot Learning — Alayrac et al. (DeepMind 2022)
The frozen-LM-with-cross-attention pattern. Still influential in modern VLM design.
::03 · paper
~6h
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models — Li et al. (Salesforce 2023)
The Q-Former approach. A different bridging architecture worth comparing to Flamingo's cross-attention.
::04 · paper
~5h
Visual Instruction Tuning — Liu, Li, Wu, Lee (LLaVA paper, 2023)
Showed that a simple MLP projection from CLIP's visual encoder into LLaMA's embedding space, plus visual instruction-tuning, gives strong results. The minimalist template.
::05 · paper
~4h
GPT-4V(ision) System Card — OpenAI (2023)
The frontier-lab framing of multimodal model evaluation and safety. Read alongside the GPT-4 technical report.
::06 · paper
~8h
Denoising Diffusion Probabilistic Models — Ho, Jain, Abbeel (2020)
The DDPM paper. Foundational for the entire diffusion-models-for-generation line.
::07 · paper
~6h
High-Resolution Image Synthesis with Latent Diffusion Models — Rombach, Blattmann, Lorenz, Esser, Ommer (Stable Diffusion paper, 2022)
Latent diffusion. The architecture that made high-resolution image generation tractable on consumer hardware.
::08 · paper
~3h
Classifier-Free Diffusion Guidance — Ho and Salimans (2022)
Classifier-free guidance is used in essentially every text-to-image diffusion model. Read the short paper.
::09 · paper
~5h
Robust Speech Recognition via Large-Scale Weak Supervision — Radford et al. (Whisper paper, OpenAI 2022)
The canonical modern speech recognition model. Demonstrates the data-scaling approach for audio.
::10 · paper
~5h
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale — Dosovitskiy et al. (Vision Transformer paper, 2020)
The ViT paper. The visual encoder architecture used in CLIP and most modern VLMs.
::11 · blog
~6h
Lilian Weng — What are Diffusion Models? (blog post)
The best diffusion-models tutorial blog post on the internet. Read alongside DDPM.
::12 · code
~15h
OpenCLIP (github.com/mlfoundations/open_clip)
Open implementation and replication of CLIP and its successors. Read the code and reproduce a small training run.

::exercises · build · derive · reproduce

01Implement CLIP from scratch (vision encoder + text encoder + contrastive loss) on a small image-text dataset. Verify the zero-shot classification mechanism.
02Implement a minimal DDPM training loop on MNIST or CIFAR-10. Visualize the forward and reverse processes.
03Train a small LLaVA-style VLM by projecting a frozen CLIP visual encoder into a frozen small LM. Fine-tune the projection on a small visual instruction dataset.
04Reproduce classifier-free guidance on top of your DDPM implementation. Compare sample quality with and without guidance.
05Fine-tune Whisper on a non-English language (or accented speech) and measure WER improvement.
06Compare Flamingo-style cross-attention bridging vs LLaVA-style MLP projection on the same task. Document the tradeoffs.

::milestones · observable

▲You can explain CLIP's contrastive loss in one paragraph.
▲You have trained a (small) diffusion model end-to-end.
▲You can read a new VLM paper and identify the bridging architecture immediately.
▲You understand why latent diffusion is faster than pixel-space diffusion.
▲You can debug a multimodal training run that is collapsing to text-only behavior.