::deep-dive

Transformers from Scratch

Attention, the architecture, and the implementation that underlies every frontier model

If you read only one paper in modern AI, read Attention Is All You Need (Vaswani et al., 2017). If you implement only one architecture, implement the transformer. Every frontier language model — GPT-4, Claude, Llama, Gemini — is a transformer at its core, and a doctorate-grade researcher must be able to derive every component on a whiteboard, implement it in PyTorch, and explain every architectural choice. The pieces are: token and positional embeddings (and the various flavors — sinusoidal, learned, RoPE, ALiBi); the multi-head attention mechanism (query, key, value projections; scaled dot-product attention; the causal mask; multi-head concatenation); the feed-forward network (typically 4x the hidden dimension); layer normalization (pre-norm vs post-norm, and why pre-norm won); residual connections; and the stack itself. Beyond the original encoder-decoder formulation, you need to understand the decoder-only variant that powers GPT-style models, the encoder-only variant that powers BERT, the various efficiency-improving variants (Flash Attention, sliding window attention, grouped-query attention, multi-query attention), and the long-context extensions (NTK scaling, YaRN, position interpolation). Karpathy's nanoGPT is the canonical pedagogical implementation — a few hundred lines of PyTorch that train a real (small) GPT on Shakespeare. The Annotated Transformer is the line-by-line walk through the original paper. The Illustrated Transformer (Jay Alammar) is the diagrammatic intuition builder. By the end of this path you should be able to implement a transformer from scratch in under an hour, debug attention layers by inspecting attention patterns, and read any modern transformer variant paper (Mixtral, Llama 3, DeepSeek V3) and identify what is novel vs standard. You should also have trained a small transformer on real data and felt the experience of the loss curve, the difference between a well-tuned and poorly-tuned setup.

::reading path · in order

::01 · paper
~6h
Attention Is All You Need — Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (2017)
The foundational paper. Read it three times: skim, deep read, then annotate. Every modern LLM descends from this architecture.
::02 · lecture
~3h
Andrej Karpathy — Let's build GPT: from scratch, in code, spelled out (YouTube)
Karpathy implements a transformer from scratch in one sitting. Watch and code along; do not skip ahead.
::03 · code
~15h
nanoGPT — Andrej Karpathy (github.com/karpathy/nanoGPT)
Read every line of the model.py and train.py. Then reproduce the Shakespeare result. Then reproduce the GPT-2 124M reproduction.
::04 · blog
~6h
The Annotated Transformer — Sasha Rush et al. (Harvard NLP)
Line-by-line walk through the original paper with executable code. The complement to Karpathy's video.
::05 · blog
~2h
The Illustrated Transformer — Jay Alammar
Diagrammatic intuition. Read after one technical pass for the visual grounding.
::06 · paper
~5h
Language Models are Few-Shot Learners — Brown et al. (GPT-3 paper, 2020)
The decoder-only scaling paper. Read for the experimental methodology and the in-context learning results.
::07 · paper
~3h
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Devlin, Chang, Lee, Toutanova (2018)
The encoder-only variant. Understanding masked language modeling and the encoder formulation.
::08 · paper
~3h
RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al. (2021)
RoPE is the modern positional encoding used by Llama, Mistral, and most open frontier models. Read it.
::09 · paper
~4h
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Dao, Fu, Ermon, Rudra, Re (2022)
The IO-aware attention reformulation that made long context tractable. Required reading for understanding modern inference.
::10 · paper
~6h
Llama 2: Open Foundation and Fine-Tuned Chat Models — Touvron et al. (2023)
A complete modern open transformer paper. Read for grouped-query attention, RMSNorm, SwiGLU, and the full training recipe.
::11 · textbook
~30h
Sebastian Raschka — Build a Large Language Model (From Scratch)
Modern book-length treatment of implementing an LLM from tokenization through fine-tuning. Excellent companion to Karpathy.

::exercises · build · derive · reproduce

01Implement scaled dot-product attention from scratch in numpy. Verify against PyTorch's F.scaled_dot_product_attention.
02Implement multi-head attention as a single PyTorch module. Train a tiny transformer on Karpathy's Shakespeare dataset.
03Reproduce nanoGPT's GPT-2 124M training run on your own hardware (or document why your hardware cannot).
04Implement RoPE positional encoding and replace the learned positional embeddings in your transformer. Compare convergence.
05Implement KV-caching for inference. Benchmark the speedup against the naive recompute version.
06Read a recent open model paper (Llama 3, Mistral, DeepSeek) and produce a one-page diff against the original Attention Is All You Need architecture.

::milestones · observable

▲You can implement a transformer from scratch in PyTorch in under one hour, from memory.
▲You can explain why we divide by sqrt(d_k) in scaled dot-product attention.
▲You have trained a transformer to a real loss number on real data.
▲You can read any modern transformer paper and immediately identify the architectural variations.
▲You can debug an attention layer by visualizing attention patterns.