::deep-dive
Mathematical Foundations for AI Research
Linear algebra, probability, calculus, and optimization — the load-bearing math under every frontier paper
Every modern AI paper you will ever read sits on four mathematical pillars: linear algebra (the language of representations), probability (the language of uncertainty), multivariable calculus (the language of optimization), and convex/non-convex optimization theory (the language of training). Skipping these is the single most common mistake of self-taught practitioners — they can fine-tune a model but cannot read the proofs in a scaling-laws paper, cannot derive their own loss function, and cannot debug a training run that diverges for mathematical reasons. A doctorate-grade learner must reach the point where matrix calculus on a whiteboard, expectation-over-distribution manipulations, and Lagrangian optimization feel as natural as writing Python. The goal of this page is not to make you a mathematician — it is to make you fluent enough that when Goodfellow writes 'we minimize the KL divergence between p_data and p_model parameterized by theta' you do not pause. You see the geometry of distributions, the gradient flowing backward, the Jacobian of the parameterization. You should be able to derive gradient descent yourself, derive the normal equations for linear regression, derive backpropagation through a two-layer network, prove that softmax-then-cross-entropy gives the clean gradient form, and compute the Fisher information matrix for a simple model. Linear algebra carries the heaviest weight — eigendecompositions, SVD, matrix calculus, the four fundamental subspaces, and the geometry of projections appear everywhere from attention mechanisms to LoRA to RLHF. Probability is the second heaviest — you need measure-theoretic intuition (without the full machinery), comfort with multivariate distributions, the exponential family, KL/JS/TV divergences, and concentration inequalities. Calculus and optimization are the connective tissue. Treat this page as a year of full-time study compressed into a structured path. Most practitioners try to skip ahead and fail; doctorate-grade work requires you to actually do the problems.
::reading path · in order
::01 · lecture
~8h
3Blue1Brown — Essence of Linear Algebra (YouTube series by Grant Sanderson)
The geometric intuition for vectors, transformations, determinants, eigenvectors, and change of basis. Watch this first; everything else makes more sense after.
::02 · lecture
~6h
3Blue1Brown — Essence of Calculus (YouTube series)
Same treatment for derivatives, integrals, and the chain rule. The chain rule episode is the prerequisite for understanding backpropagation.
::03 · course
~60h
MIT OCW 18.06 — Linear Algebra (Gilbert Strang)
The canonical undergraduate linear algebra course. Strang's four-fundamental-subspaces framing is the mental model used by every ML researcher.
::04 · textbook
~40h
Introduction to Linear Algebra — Gilbert Strang (textbook, 5th or 6th edition)
The companion text to 18.06. Do the problem sets — the exam-style problems force you to compute rather than recognize.
::05 · course
~50h
MIT OCW 18.05 — Introduction to Probability and Statistics (Orloff and Bloom)
Bayesian-flavored probability course with clean problem sets. Builds the conditional-probability intuition you need for graphical models and variational inference.
::06 · course
~80h
MIT OCW 18.01 / 18.02 — Single and Multivariable Calculus
Gradients, Jacobians, Hessians, and Lagrange multipliers. The multivariable course is where vector calculus becomes the language of optimization.
::07 · textbook
~6h
The Matrix Cookbook — Petersen and Pedersen
Free reference PDF. Memorize the matrix-calculus identities pages — they are the bread and butter of derivation work.
::08 · textbook
~80h
Convex Optimization — Boyd and Vandenberghe (free PDF)
Even though deep learning is non-convex, the geometric intuitions (duality, KKT conditions, gradient descent convergence) carry over. Stanford EE364A lectures are the companion.
::09 · textbook
~60h
Probability Theory: The Logic of Science — E.T. Jaynes
Read selectively. The first 4 chapters reframe probability as extended logic and will permanently improve how you reason about uncertainty in ML systems.
::10 · textbook
~50h
Mathematics for Machine Learning — Deisenroth, Faisal, Ong (free PDF)
Bridges the gap between pure math and ML applications. Useful integrator after the foundational courses.
::exercises · build · derive · reproduce
- 01Derive the closed-form solution to ordinary least squares from scratch using only matrix calculus. Verify against numpy.linalg.lstsq.
- 02Implement SVD by hand on a 3x3 matrix using the eigendecomposition of A^T A. Compare to numpy.linalg.svd.
- 03Derive backpropagation for a two-layer MLP with cross-entropy loss on paper, then implement it in pure numpy with no autograd.
- 04Prove that softmax composed with cross-entropy yields the gradient (p - y). Show every step.
- 05Compute the KL divergence between two multivariate Gaussians in closed form. Verify against a Monte Carlo estimate.
- 06Implement gradient descent and Newton's method for logistic regression. Plot convergence on the same loss landscape.
::milestones · observable
- ▲You can derive backpropagation on a whiteboard with no notes.
- ▲You can read a paper that says 'minimize the KL between q(z|x) and p(z|x)' and immediately picture the geometry.
- ▲You can derive the gradient of a custom loss without consulting external references.
- ▲You can implement linear regression, logistic regression, and a two-layer MLP in pure numpy.
- ▲You recognize when a paper's claim depends on a convexity, smoothness, or concentration assumption.