::deep-dive

Deep Learning Fundamentals

From the perceptron to ResNets — the canon before transformers

Deep learning is not the totality of modern AI but it is the substrate of every frontier model. Before approaching transformers, RLHF, or interpretability, you need a deep working command of the deep learning fundamentals: forward and backward propagation, activation functions and their properties, weight initialization (Xavier, He), normalization layers (BatchNorm, LayerNorm, GroupNorm), the convolutional architecture, residual connections, the vanishing/exploding gradient problem and its mitigations, dropout, the various flavors of regularization, the standard optimizers (SGD, momentum, Adam, AdamW, Lion) and their convergence properties, learning rate schedules, gradient clipping, the difference between training/validation/test regimes, the practicalities of GPU training, and the empirical scaling phenomena that distinguish deep learning from classical ML. The Goodfellow-Bengio-Courville textbook is the canonical reference and remains the best single source for the theory through 2016; everything published since is in papers. Pair it with Andrew Ng's Deep Learning Specialization for the pedagogical structure and fast.ai's deep learning course for the hands-on counterweight. Karpathy's 'Zero to Hero' YouTube series is the modern essential — building micrograd, then makemore, then a transformer from scratch builds intuition in a way no textbook can. By the end of this path you should be able to implement and train a ResNet on CIFAR-10 from scratch without consulting external code, debug a training run that is failing to converge, recognize the standard architectures by their PyTorch summaries, and explain why a particular architectural choice (residual connections, layer normalization, attention) was added. You should also have direct hands-on experience with at least one full training run — feeling the difference between epoch 1 and epoch 50, between a good learning rate and a bad one, is irreplaceable.

::reading path · in order

::01 · textbook
~120h
Deep Learning — Goodfellow, Bengio, Courville (free online)
The canonical reference for deep learning theory up to 2016. Part I (math) and Part II (modern practical deep networks) are mandatory.
::02 · lecture
~25h
Andrej Karpathy — Neural Networks: Zero to Hero (YouTube series)
Watch every video. Building micrograd from scratch, then makemore character-level models, then a transformer is the single best modern deep learning curriculum.
::03 · course
~60h
Andrew Ng — Deep Learning Specialization (Coursera, deeplearning.ai)
Five courses covering basics through sequence models. Good structure if you need a paced curriculum.
::04 · course
~50h
fast.ai — Practical Deep Learning for Coders Part 2 (From Deep Learning Foundations to Stable Diffusion)
Builds a deep learning library from scratch over a series of lessons. Bottom-up complement to the top-down Part 1.
::05 · paper
~3h
Deep Residual Learning for Image Recognition — He, Zhang, Ren, Sun (ResNet paper, 2015)
Read it. The residual connection is one of the three or four most important architectural ideas in deep learning.
::06 · paper
~2h
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift — Ioffe and Szegedy (2015)
BatchNorm changed training. Read it alongside the LayerNorm paper (Ba, Kiros, Hinton, 2016).
::07 · paper
~2h
Adam: A Method for Stochastic Optimization — Kingma and Ba (2014)
The optimizer you will use most often. Understand the bias correction and the momentum/RMSprop hybrid.
::08 · textbook
~80h
Dive into Deep Learning — Zhang, Lipton, Li, Smola (d2l.ai, free interactive book)
Interactive textbook with PyTorch, MXNet, and TensorFlow code for every chapter. Excellent supplementary reference.
::09 · code
~20h
PyTorch official tutorials (pytorch.org/tutorials)
The 60-minute blitz, then the more advanced tutorials. PyTorch is the standard for research.
::10 · blog
~6h
The Annotated Transformer — Sasha Rush et al. (Harvard NLP)
Bridge to the transformers page. Walks through Attention Is All You Need with executable code interleaved.

::exercises · build · derive · reproduce

01Implement micrograd from Karpathy's video, then extend it to support vector operations and a working two-layer MLP.
02Implement a ResNet-18 from scratch in PyTorch and train it on CIFAR-10 to above 90% test accuracy.
03Reproduce a learning rate sweep: train the same model with 10 different learning rates and plot loss curves.
04Implement batch normalization from scratch as a custom autograd module. Verify gradients against PyTorch's nn.BatchNorm.
05Train a model with intentional bugs (wrong initialization, no normalization, no residuals) and document what goes wrong.
06Profile a training run with PyTorch profiler. Identify the bottleneck (GPU compute, data loading, host-device transfer).

::milestones · observable

▲You can implement backprop from scratch with no autograd.
▲You have trained a ResNet to a real accuracy number on a real dataset.
▲You can debug a non-converging training run.
▲You can explain why residual connections enable depth.
▲You can read a PyTorch model summary and predict its memory and compute footprint.