built throughORANGEBOX·see what it ships·$1 →

AtomEons / Learn / Deep / Multimodal Models

::deep-dive

Multimodal Models

Vision-language models, audio, video, and the architectures that bridge modalities

Language is one modality among many. The frontier of AI is increasingly multimodal — models that see, hear, and (less commonly so far) act in continuous spaces. The doctorate-grade curriculum here begins with CLIP (Contrastive Language-Image Pretraining, Radford et al., 2021), which established the dual-encoder paradigm of joint vision-language embedding spaces and remains the load-bearing model for downstream multimodal work, from text-to-image generation to retrieval. From CLIP you move to the modern visual-language model architectures: Flamingo (Alayrac et al., DeepMind 2022) for the cross-attention-into-frozen-LM pattern; BLIP-2 (Li et al., Salesforce 2023) for the Q-Former projection approach; LLaVA (Liu et al., 2023) and the family of open VLMs that followed, which established that a simple linear projection from a vision encoder into an LLM's embedding space can match much more complex architectures. For diffusion-based generation, the canonical path is the original DDPM paper (Ho, Jain, Abbeel, 2020), then the latent diffusion paper that underlies Stable Diffusion (Rombach et al., 2022), then classifier-free guidance (Ho and Salimans, 2022). For text-to-image alignment and quality, DALL-E 2 (Ramesh et al., 2022) and Imagen (Saharia et al., 2022) define the canonical architectures, even if their successors have iterated significantly. GPT-4V's system card (OpenAI, 2023) is the canonical industry document for understanding the safety and capability evaluation framing of frontier multimodal models. Audio models — Whisper for speech recognition (Radford et al., 2022), MusicLM and AudioLM for generation — round out the modality picture. The unifying theme: most successful multimodal architectures freeze one modality's encoder, project into a shared space, and let cross-attention or a connecting MLP do the bridging work. A doctorate-grade learner should understand both the architectural choices and the data-curation choices that make these models work; multimodal data is much harder to curate than text-only data, and the data side is often where the field-defining work happens.

::reading path · in order

  1. ::01 · paper

    ~6h

    Learning Transferable Visual Models From Natural Language Supervision — Radford et al. (CLIP paper, OpenAI 2021)

    The contrastive vision-language paradigm. Foundational for everything multimodal that came after.

  2. ::02 · paper

    ~8h

    Flamingo: a Visual Language Model for Few-Shot Learning — Alayrac et al. (DeepMind 2022)

    The frozen-LM-with-cross-attention pattern. Still influential in modern VLM design.

  3. ::03 · paper

    ~6h

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models — Li et al. (Salesforce 2023)

    The Q-Former approach. A different bridging architecture worth comparing to Flamingo's cross-attention.

  4. ::04 · paper

    ~5h

    Visual Instruction Tuning — Liu, Li, Wu, Lee (LLaVA paper, 2023)

    Showed that a simple MLP projection from CLIP's visual encoder into LLaMA's embedding space, plus visual instruction-tuning, gives strong results. The minimalist template.

  5. ::05 · paper

    ~4h

    GPT-4V(ision) System Card — OpenAI (2023)

    The frontier-lab framing of multimodal model evaluation and safety. Read alongside the GPT-4 technical report.

  6. ::06 · paper

    ~8h

    Denoising Diffusion Probabilistic Models — Ho, Jain, Abbeel (2020)

    The DDPM paper. Foundational for the entire diffusion-models-for-generation line.

  7. ::07 · paper

    ~6h

    High-Resolution Image Synthesis with Latent Diffusion Models — Rombach, Blattmann, Lorenz, Esser, Ommer (Stable Diffusion paper, 2022)

    Latent diffusion. The architecture that made high-resolution image generation tractable on consumer hardware.

  8. ::08 · paper

    ~3h

    Classifier-Free Diffusion Guidance — Ho and Salimans (2022)

    Classifier-free guidance is used in essentially every text-to-image diffusion model. Read the short paper.

  9. ::09 · paper

    ~5h

    Robust Speech Recognition via Large-Scale Weak Supervision — Radford et al. (Whisper paper, OpenAI 2022)

    The canonical modern speech recognition model. Demonstrates the data-scaling approach for audio.

  10. ::10 · paper

    ~5h

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale — Dosovitskiy et al. (Vision Transformer paper, 2020)

    The ViT paper. The visual encoder architecture used in CLIP and most modern VLMs.

  11. ::11 · blog

    ~6h

    Lilian Weng — What are Diffusion Models? (blog post)

    The best diffusion-models tutorial blog post on the internet. Read alongside DDPM.

  12. ::12 · code

    ~15h

    OpenCLIP (github.com/mlfoundations/open_clip)

    Open implementation and replication of CLIP and its successors. Read the code and reproduce a small training run.

::exercises · build · derive · reproduce

  1. 01Implement CLIP from scratch (vision encoder + text encoder + contrastive loss) on a small image-text dataset. Verify the zero-shot classification mechanism.
  2. 02Implement a minimal DDPM training loop on MNIST or CIFAR-10. Visualize the forward and reverse processes.
  3. 03Train a small LLaVA-style VLM by projecting a frozen CLIP visual encoder into a frozen small LM. Fine-tune the projection on a small visual instruction dataset.
  4. 04Reproduce classifier-free guidance on top of your DDPM implementation. Compare sample quality with and without guidance.
  5. 05Fine-tune Whisper on a non-English language (or accented speech) and measure WER improvement.
  6. 06Compare Flamingo-style cross-attention bridging vs LLaVA-style MLP projection on the same task. Document the tradeoffs.

::milestones · observable

  • You can explain CLIP's contrastive loss in one paragraph.
  • You have trained a (small) diffusion model end-to-end.
  • You can read a new VLM paper and identify the bridging architecture immediately.
  • You understand why latent diffusion is faster than pixel-space diffusion.
  • You can debug a multimodal training run that is collapsing to text-only behavior.
LAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHMLAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHM