Two empty matte-black chairs across a wide dark wooden table — the interview is the conversation.

AI interviews — 100 questions, honest answers

The prep guide we wish existed when we were on the other side of the table

Most AI interview prep on the open web is recycled blog posts written before transformers were the default architecture. The questions are stale, the answers are hand-wavy, and the rubrics never address what hiring managers actually care about in 2026: whether you can think about loss curves, ship code that survives contact with adversarial users, and reason about trade-offs in systems where the failure modes are statistical rather than deterministic. This page is 100 questions split across five categories — ML fundamentals (50), system design (15), behavioral (15), domain-specific (10 per safety, infra, product, totaling 30, with the page calling out 10 representative items from each track), and live coding (10) — with four annotations per question: the literal question, what the interviewer is really probing, an outline of a strong answer, and the red flags that tank candidates. We avoid the LeetCode trivia treadmill and the cargo-cult system design narration. Where math matters, we show the math. Where ambiguity matters, we name it. This is a guide to thinking like the person who designed the role, not a memorization aid. If you can answer the underlying probe — not just the surface question — you'll generalize to whatever phrasing your interviewer chose. We cite Chip Huyen's *Machine Learning Interviews* (O'Reilly, 2024) where the canonical framing comes from her, and we cite original papers and official engineering blogs everywhere else. As of June 2026 best-effort: rubrics and salary bands drift, so always cross-check the company's published levels and their most recent engineering posts before the loop. The job market shifts faster than the literature.

How this guide is organized

Every question in this page follows the same four-field structure: (1) the question as it would be asked, (2) what the interviewer is really asking — the underlying skill or signal, (3) a strong answer outline showing the shape of a good response without scripting it verbatim, and (4) red flags — the responses that consistently sink candidates. Memorizing the outline is a trap. Understanding the probe is the goal. We deliberately do not provide model answers in full prose. Reciting a memorized answer is the single most reliable way to fail an AI loop with experienced interviewers, because the structure of a memorized answer is recognizable within fifteen seconds. Interviewers at Anthropic, OpenAI, DeepMind, and the frontier labs are explicitly trained to probe past surface-level fluency. They will follow up. The follow-ups are where signal lives. Category distribution reflects what we have observed across roughly 200 frontier-lab and AI-startup loops between 2023 and 2026: ML fundamentals dominate research and applied-research roles, system design dominates infrastructure and platform engineering, behavioral interviews are universal but have acquired a new AI-era flavor (questions about trusting model output, about deploying systems with non-deterministic failure modes), domain-specific rounds gate roles in safety, infra, and product, and live coding remains the universal gate. Skipping any category is risky. Over-indexing on one is the more common failure mode.

ML fundamentals — 50 questions, mapped

These are the 50 ML-fundamentals questions we see most frequently across applied-ML, research-engineer, and ML-engineer loops. Difficulty is calibrated to the L4–L6 range at frontier labs. The 'probe' column is what the interviewer is actually scoring, which is often different from the literal question. Citations point to the canonical source where one exists.

#	Question	What they're really asking	Canonical source
1	Explain the bias-variance tradeoff	Whether you understand generalization, not just decomposition	Hastie, Tibshirani, Friedman · ESL ch. 2
2	Derive the gradient of softmax cross-entropy	Comfort with vector calculus and the simplification at the output layer	Bishop · PRML ch. 4
3	Why does attention scale by sqrt(d_k)?	Whether you understand variance control in dot products	Vaswani et al. · arxiv.org/abs/1706.03762
4	Compare batch norm, layer norm, RMSNorm	Whether you understand what each normalizes and why transformers use layer/RMS	Zhang and Sennrich · arxiv.org/abs/1910.07467
5	What is the Lottery Ticket Hypothesis?	Whether you read 2018-era research that still matters for pruning	Frankle and Carbin · arxiv.org/abs/1803.03635
6	Explain KL divergence vs cross-entropy	Whether you can derive one from the other and name when each is the right tool	Cover and Thomas · Information Theory ch. 2
7	Why does Adam often outperform SGD on transformers?	Whether you understand adaptive learning rates and second-moment estimation	Kingma and Ba · arxiv.org/abs/1412.6980
8	Derive backprop through a single linear layer	Whether you can do it by hand without TensorFlow autodiff	Goodfellow et al. · Deep Learning ch. 6
9	What is the role of temperature in softmax sampling?	Whether you understand entropy of the output distribution as a knob	Hinton et al. · arxiv.org/abs/1503.02531
10	Compare LoRA, prefix tuning, full fine-tuning	Whether you understand parameter efficiency tradeoffs	Hu et al. · arxiv.org/abs/2106.09685
11	Explain RLHF end to end	Whether you can name reward model, policy model, KL penalty, and PPO loop	Ouyang et al. · arxiv.org/abs/2203.02155
12	Why does GRPO differ from PPO?	Whether you tracked 2024-era RL-for-LLM developments	DeepSeekMath · arxiv.org/abs/2402.03300
13	What is grokking?	Whether you read Power et al. and understand delayed generalization	Power et al. · arxiv.org/abs/2201.02177
14	Explain the Chinchilla scaling law	Whether you can recite the compute-optimal token-to-parameter ratio	Hoffmann et al. · arxiv.org/abs/2203.15556
15	What is mixture of experts and why is it efficient?	Whether you understand sparse activation and routing	Shazeer et al. · arxiv.org/abs/1701.06538
16	Derive the ELBO for a VAE	Whether you understand variational inference at a mechanical level	Kingma and Welling · arxiv.org/abs/1312.6114
17	Compare contrastive learning losses (NT-Xent, triplet, InfoNCE)	Whether you know which is used where and why	Chen et al. · arxiv.org/abs/2002.05709
18	What is the curse of dimensionality?	Whether you can name three specific failure modes, not just recite the phrase	Hastie et al. · ESL ch. 2.5
19	Explain why dropout works	Whether you understand the ensemble interpretation and the Bayesian one	Srivastava et al. · JMLR 2014
20	What is the difference between encoder, decoder, and encoder-decoder transformers?	Whether you can name three model families that use each	Vaswani et al. · arxiv.org/abs/1706.03762
21	Derive the loss landscape of a 1-layer linear network	Whether you understand non-convexity in deep nets versus convexity here	Saxe et al. · arxiv.org/abs/1312.6120
22	What is mode collapse in GANs?	Whether you understand why GANs are hard and how WGAN addresses it	Arjovsky et al. · arxiv.org/abs/1701.07875
23	Explain the difference between MLE and MAP	Whether you understand priors and when they bite	Bishop · PRML ch. 1
24	Why is cross-validation insufficient for time series?	Whether you understand temporal leakage	Hyndman and Athanasopoulos · Forecasting: Principles and Practice
25	What is calibration in classification, and how do you measure it?	Whether you know Expected Calibration Error and Brier score	Guo et al. · arxiv.org/abs/1706.04599
26	Compare flash attention to vanilla attention	Whether you understand IO complexity and memory hierarchy	Dao et al. · arxiv.org/abs/2205.14135
27	What is a Pareto frontier in multi-objective optimization?	Whether you can think about tradeoffs without collapsing to one scalar	Boyd and Vandenberghe · Convex Optimization ch. 4
28	Explain rotary position embeddings	Whether you read Su et al. and understand why RoPE replaced sinusoidal	Su et al. · arxiv.org/abs/2104.09864
29	What is the difference between weight tying and weight sharing?	Whether you can name input-output embedding tying in transformers	Press and Wolf · arxiv.org/abs/1608.05859
30	Why is BPE the dominant tokenization scheme?	Whether you can name the alternatives (WordPiece, SentencePiece, Unigram) and the tradeoffs	Sennrich et al. · arxiv.org/abs/1508.07909
31	Explain the softmax bottleneck	Whether you understand low-rank limitations of softmax outputs	Yang et al. · arxiv.org/abs/1711.03953
32	What is gradient checkpointing, and what does it trade?	Whether you understand the memory-compute tradeoff in training	Chen et al. · arxiv.org/abs/1604.06174
33	Compare FP32, FP16, BF16, FP8 for training	Whether you understand range vs precision and which fail modes appear when	Micikevicius et al. · arxiv.org/abs/1710.03740
34	Why does in-context learning work?	An honest 'we still don't fully know, here are the candidate theories'	Olsson et al. · transformer-circuits.pub/2022/in-context-learning-and-induction-heads
35	Explain the Platt scaling vs isotonic regression for calibration	Whether you can pick the right method for the data regime	Niculescu-Mizil and Caruana · ICML 2005
36	What is a constitutional AI loop?	Whether you've read Anthropic's CAI paper and can recite the two-stage process	Bai et al. · arxiv.org/abs/2212.08073
37	Explain reward hacking with three examples	Whether you can name specific cases, not just the abstract concept	Krakovna et al. · deepmind blog 2020 specification gaming
38	What is the difference between an LLM's perplexity and its downstream accuracy?	Whether you understand that PPL is a proxy and not a goal	Liu et al. · arxiv.org/abs/2305.16264
39	Compare best-of-N, beam search, and nucleus sampling	Whether you know what each optimizes and when each fails	Holtzman et al. · arxiv.org/abs/1904.09751
40	Why is greedy decoding often worse than sampling?	Whether you understand the likelihood-quality gap	Holtzman et al. · arxiv.org/abs/1904.09751
41	Derive the entropy of a Bernoulli distribution	Whether you can do basic information theory under pressure	Cover and Thomas · ch. 2
42	Explain ImageNet pretraining for vision and why it transferred	Whether you can articulate the transfer learning insight without overstating it	Krizhevsky et al. · NeurIPS 2012
43	What is the role of the value function in PPO?	Whether you understand actor-critic and variance reduction	Schulman et al. · arxiv.org/abs/1707.06347
44	Compare DPO to PPO-based RLHF	Whether you tracked the 2023-2024 shift toward DPO	Rafailov et al. · arxiv.org/abs/2305.18290
45	What is a constitutional principle and how does it differ from a reward signal?	Whether you understand declarative vs scalar feedback	Bai et al. · arxiv.org/abs/2212.08073
46	Why does scaling laws research use compute-optimal frontiers?	Whether you understand the difference between under- and over-trained models	Hoffmann et al. · arxiv.org/abs/2203.15556
47	Explain emergent capabilities and the recent skepticism	Whether you've read Schaeffer et al. and updated your priors	Schaeffer et al. · arxiv.org/abs/2304.15004
48	What is the role of layer norm position (pre-norm vs post-norm)?	Whether you know why GPT-2 onward used pre-norm	Xiong et al. · arxiv.org/abs/2002.04745
49	Why does a long context window not equal good long-context performance?	Whether you understand the 'lost in the middle' result	Liu et al. · arxiv.org/abs/2307.03172
50	Explain how you would estimate the FLOPs of a forward pass through a transformer	Whether you can derive 6N approximation (Chinchilla) and explain where it comes from	Hoffmann et al. · arxiv.org/abs/2203.15556

QuestionExplain the bias-variance tradeoff

What they're really askingWhether you understand generalization, not just decomposition

Canonical sourceHastie, Tibshirani, Friedman · ESL ch. 2

QuestionDerive the gradient of softmax cross-entropy

What they're really askingComfort with vector calculus and the simplification at the output layer

Canonical sourceBishop · PRML ch. 4

QuestionWhy does attention scale by sqrt(d_k)?

What they're really askingWhether you understand variance control in dot products

Canonical sourceVaswani et al. · arxiv.org/abs/1706.03762

QuestionCompare batch norm, layer norm, RMSNorm

What they're really askingWhether you understand what each normalizes and why transformers use layer/RMS

Canonical sourceZhang and Sennrich · arxiv.org/abs/1910.07467

QuestionWhat is the Lottery Ticket Hypothesis?

What they're really askingWhether you read 2018-era research that still matters for pruning

Canonical sourceFrankle and Carbin · arxiv.org/abs/1803.03635

QuestionExplain KL divergence vs cross-entropy

What they're really askingWhether you can derive one from the other and name when each is the right tool

Canonical sourceCover and Thomas · Information Theory ch. 2

QuestionWhy does Adam often outperform SGD on transformers?

What they're really askingWhether you understand adaptive learning rates and second-moment estimation

Canonical sourceKingma and Ba · arxiv.org/abs/1412.6980

QuestionDerive backprop through a single linear layer

What they're really askingWhether you can do it by hand without TensorFlow autodiff

Canonical sourceGoodfellow et al. · Deep Learning ch. 6

QuestionWhat is the role of temperature in softmax sampling?

What they're really askingWhether you understand entropy of the output distribution as a knob

Canonical sourceHinton et al. · arxiv.org/abs/1503.02531

#10

QuestionCompare LoRA, prefix tuning, full fine-tuning

What they're really askingWhether you understand parameter efficiency tradeoffs

Canonical sourceHu et al. · arxiv.org/abs/2106.09685

#11

QuestionExplain RLHF end to end

What they're really askingWhether you can name reward model, policy model, KL penalty, and PPO loop

Canonical sourceOuyang et al. · arxiv.org/abs/2203.02155

#12

QuestionWhy does GRPO differ from PPO?

What they're really askingWhether you tracked 2024-era RL-for-LLM developments

Canonical sourceDeepSeekMath · arxiv.org/abs/2402.03300

#13

QuestionWhat is grokking?

What they're really askingWhether you read Power et al. and understand delayed generalization

Canonical sourcePower et al. · arxiv.org/abs/2201.02177

#14

QuestionExplain the Chinchilla scaling law

What they're really askingWhether you can recite the compute-optimal token-to-parameter ratio

Canonical sourceHoffmann et al. · arxiv.org/abs/2203.15556

#15

QuestionWhat is mixture of experts and why is it efficient?

What they're really askingWhether you understand sparse activation and routing

Canonical sourceShazeer et al. · arxiv.org/abs/1701.06538

#16

QuestionDerive the ELBO for a VAE

What they're really askingWhether you understand variational inference at a mechanical level

Canonical sourceKingma and Welling · arxiv.org/abs/1312.6114

#17

QuestionCompare contrastive learning losses (NT-Xent, triplet, InfoNCE)

What they're really askingWhether you know which is used where and why

Canonical sourceChen et al. · arxiv.org/abs/2002.05709

#18

QuestionWhat is the curse of dimensionality?

What they're really askingWhether you can name three specific failure modes, not just recite the phrase

Canonical sourceHastie et al. · ESL ch. 2.5

#19

QuestionExplain why dropout works

What they're really askingWhether you understand the ensemble interpretation and the Bayesian one

Canonical sourceSrivastava et al. · JMLR 2014

#20

QuestionWhat is the difference between encoder, decoder, and encoder-decoder transformers?

What they're really askingWhether you can name three model families that use each

Canonical sourceVaswani et al. · arxiv.org/abs/1706.03762

#21

QuestionDerive the loss landscape of a 1-layer linear network

What they're really askingWhether you understand non-convexity in deep nets versus convexity here

Canonical sourceSaxe et al. · arxiv.org/abs/1312.6120

#22

QuestionWhat is mode collapse in GANs?

What they're really askingWhether you understand why GANs are hard and how WGAN addresses it

Canonical sourceArjovsky et al. · arxiv.org/abs/1701.07875

#23

QuestionExplain the difference between MLE and MAP

What they're really askingWhether you understand priors and when they bite

Canonical sourceBishop · PRML ch. 1

#24

QuestionWhy is cross-validation insufficient for time series?

What they're really askingWhether you understand temporal leakage

Canonical sourceHyndman and Athanasopoulos · Forecasting: Principles and Practice

#25

QuestionWhat is calibration in classification, and how do you measure it?

What they're really askingWhether you know Expected Calibration Error and Brier score

Canonical sourceGuo et al. · arxiv.org/abs/1706.04599

#26

QuestionCompare flash attention to vanilla attention

What they're really askingWhether you understand IO complexity and memory hierarchy

Canonical sourceDao et al. · arxiv.org/abs/2205.14135

#27

QuestionWhat is a Pareto frontier in multi-objective optimization?

What they're really askingWhether you can think about tradeoffs without collapsing to one scalar

Canonical sourceBoyd and Vandenberghe · Convex Optimization ch. 4

#28

QuestionExplain rotary position embeddings

What they're really askingWhether you read Su et al. and understand why RoPE replaced sinusoidal

Canonical sourceSu et al. · arxiv.org/abs/2104.09864

#29

QuestionWhat is the difference between weight tying and weight sharing?

What they're really askingWhether you can name input-output embedding tying in transformers

Canonical sourcePress and Wolf · arxiv.org/abs/1608.05859

#30

QuestionWhy is BPE the dominant tokenization scheme?

What they're really askingWhether you can name the alternatives (WordPiece, SentencePiece, Unigram) and the tradeoffs

Canonical sourceSennrich et al. · arxiv.org/abs/1508.07909

#31

QuestionExplain the softmax bottleneck

What they're really askingWhether you understand low-rank limitations of softmax outputs

Canonical sourceYang et al. · arxiv.org/abs/1711.03953

#32

QuestionWhat is gradient checkpointing, and what does it trade?

What they're really askingWhether you understand the memory-compute tradeoff in training

Canonical sourceChen et al. · arxiv.org/abs/1604.06174

#33

QuestionCompare FP32, FP16, BF16, FP8 for training

What they're really askingWhether you understand range vs precision and which fail modes appear when

Canonical sourceMicikevicius et al. · arxiv.org/abs/1710.03740

#34

QuestionWhy does in-context learning work?

What they're really askingAn honest 'we still don't fully know, here are the candidate theories'

Canonical sourceOlsson et al. · transformer-circuits.pub/2022/in-context-learning-and-induction-heads

#35

QuestionExplain the Platt scaling vs isotonic regression for calibration

What they're really askingWhether you can pick the right method for the data regime

Canonical sourceNiculescu-Mizil and Caruana · ICML 2005

#36

QuestionWhat is a constitutional AI loop?

What they're really askingWhether you've read Anthropic's CAI paper and can recite the two-stage process

Canonical sourceBai et al. · arxiv.org/abs/2212.08073

#37

QuestionExplain reward hacking with three examples

What they're really askingWhether you can name specific cases, not just the abstract concept

Canonical sourceKrakovna et al. · deepmind blog 2020 specification gaming

#38

QuestionWhat is the difference between an LLM's perplexity and its downstream accuracy?

What they're really askingWhether you understand that PPL is a proxy and not a goal

Canonical sourceLiu et al. · arxiv.org/abs/2305.16264

#39

QuestionCompare best-of-N, beam search, and nucleus sampling

What they're really askingWhether you know what each optimizes and when each fails

Canonical sourceHoltzman et al. · arxiv.org/abs/1904.09751

#40

QuestionWhy is greedy decoding often worse than sampling?

What they're really askingWhether you understand the likelihood-quality gap

Canonical sourceHoltzman et al. · arxiv.org/abs/1904.09751

#41

QuestionDerive the entropy of a Bernoulli distribution

What they're really askingWhether you can do basic information theory under pressure

Canonical sourceCover and Thomas · ch. 2

#42

QuestionExplain ImageNet pretraining for vision and why it transferred

What they're really askingWhether you can articulate the transfer learning insight without overstating it

Canonical sourceKrizhevsky et al. · NeurIPS 2012

#43

QuestionWhat is the role of the value function in PPO?

What they're really askingWhether you understand actor-critic and variance reduction

Canonical sourceSchulman et al. · arxiv.org/abs/1707.06347

#44

QuestionCompare DPO to PPO-based RLHF

What they're really askingWhether you tracked the 2023-2024 shift toward DPO

Canonical sourceRafailov et al. · arxiv.org/abs/2305.18290

#45

QuestionWhat is a constitutional principle and how does it differ from a reward signal?

What they're really askingWhether you understand declarative vs scalar feedback

Canonical sourceBai et al. · arxiv.org/abs/2212.08073

#46

QuestionWhy does scaling laws research use compute-optimal frontiers?

What they're really askingWhether you understand the difference between under- and over-trained models

Canonical sourceHoffmann et al. · arxiv.org/abs/2203.15556

#47

QuestionExplain emergent capabilities and the recent skepticism

What they're really askingWhether you've read Schaeffer et al. and updated your priors

Canonical sourceSchaeffer et al. · arxiv.org/abs/2304.15004

#48

QuestionWhat is the role of layer norm position (pre-norm vs post-norm)?

What they're really askingWhether you know why GPT-2 onward used pre-norm

Canonical sourceXiong et al. · arxiv.org/abs/2002.04745

#49

QuestionWhy does a long context window not equal good long-context performance?

What they're really askingWhether you understand the 'lost in the middle' result

Canonical sourceLiu et al. · arxiv.org/abs/2307.03172

#50

QuestionExplain how you would estimate the FLOPs of a forward pass through a transformer

What they're really askingWhether you can derive 6N approximation (Chinchilla) and explain where it comes from

Canonical sourceHoffmann et al. · arxiv.org/abs/2203.15556

System design — 15 questions for AI infrastructure

System design rounds for AI roles diverge sharply from classical web-scale design rounds. You will be expected to reason about retrieval quality, eval pipelines, and the fact that your system's behavior is statistical rather than deterministic. Below are 15 representative questions with the same four-field annotation pattern. We have collapsed each into a card; the full probe-and-red-flag breakdown is the same structure as the ML fundamentals table.

1. Design a RAG system for legal documents

Reference: Lewis et al. · arxiv.org/abs/2005.11401

Probe: whether you understand chunking strategy, embedding model selection, hybrid retrieval (BM25 + dense), reranking, and the eval pipeline. Strong answer: name a chunking strategy with overlap, justify embedding choice with cost/quality tradeoff, address citation accuracy as the primary eval. Red flag: skipping eval entirely or proposing a single dense retrieval pass with no reranking.

2. Design a rate limiter for an LLM API

Reference: OpenAI rate-limit docs · platform.openai.com

Probe: whether you can reason about token-based vs request-based limits and the fairness problem when prompts vary 100x in length. Strong answer: token bucket per user, separate budgets for input and output tokens, queue with bounded wait. Red flag: only naming requests-per-second.

3. Design an eval pipeline for a chat model

Reference: Anthropic · evaluating-ai-systems

Probe: whether you can distinguish offline benchmarks from online A/B tests and reason about cost. Strong answer: tiered eval — fast automated benchmarks on every commit, slower model-graded evals nightly, human eval on a sampled rollout. Red flag: proposing one giant benchmark and stopping there.

4. Design a vector database from scratch

Reference: Malkov and Yashunin · arxiv.org/abs/1603.09320

Probe: whether you understand HNSW, IVF, product quantization, and the recall-latency-memory triangle. Strong answer: pick HNSW for sub-ms latency at moderate scale, explain layered graph construction. Red flag: 'just use Pinecone.'

5. Design a system to serve 10M users a 70B model

Reference: vLLM paper · arxiv.org/abs/2309.06180

Probe: whether you can reason about batching, KV cache management, speculative decoding, and tensor parallelism. Strong answer: continuous batching, paged KV cache (vLLM-style), spec decoding with a small draft model. Red flag: ignoring batching entirely.

6. Design a content moderation pipeline

Reference: Markov et al. · arxiv.org/abs/2208.03274

Probe: whether you understand the safety stack — small classifiers as a first line, larger model for ambiguous cases, human review for the long tail. Strong answer: tiered defense with explicit precision-recall targets per tier. Red flag: a single LLM call gating everything.

7. Design an agent framework with tool use

Reference: ReAct · arxiv.org/abs/2210.03629

Probe: whether you understand the planning loop, tool schemas, error recovery, and the loop-termination problem. Strong answer: structured tool schemas, retries with bounded depth, explicit termination conditions. Red flag: not addressing infinite-loop safety.

8. Design a fine-tuning pipeline for a domain-specific model

Reference: Lee et al. · arxiv.org/abs/2107.06499

Probe: whether you can specify data quality gates, eval strategy, and the LoRA-vs-full decision. Strong answer: dedup, deduplication, quality filters, then small-scale LoRA pilot before scaling. Red flag: skipping the dedup and quality gating.

9. Design a prompt-injection defense layer

Reference: Greshake et al. · arxiv.org/abs/2302.12173

Probe: whether you understand that the threat is real and that no single defense is complete. Strong answer: input sanitization, structured prompts, output filtering, plus monitoring for anomalous patterns. Red flag: claiming any single technique solves it.

10. Design an embedding index that updates in near-real-time

Reference: Pinecone engineering blog · 2023 indexing

Probe: whether you understand the rebuild-vs-incremental tradeoff in HNSW. Strong answer: write-ahead log of updates, periodic full rebuilds, two-index swap for fresh data. Red flag: pretending HNSW supports easy deletion.

11. Design a system to detect and flag model regressions

Reference: Google · ml-test-score paper

Probe: whether you can specify a regression test suite for a non-deterministic system. Strong answer: golden-set evals, statistical significance thresholds, automatic alerting with bounded false positives. Red flag: 'we'll just look at the output.'

12. Design a multi-tenant inference platform

Reference: Anyscale · serving blog 2024

Probe: whether you understand isolation, fairness, and noisy neighbor problems. Strong answer: per-tenant queues, shared KV cache with tenant-aware eviction, isolation at the request level. Red flag: ignoring noisy neighbors entirely.

13. Design a system to detect data drift

Reference: Rabanser et al. · arxiv.org/abs/1810.11953

Probe: whether you understand statistical tests for distribution shift and the cost of false alarms. Strong answer: PSI or KS tests on input features, EMD on embeddings, action thresholds tuned to retraining cost. Red flag: continuous retraining without a trigger.

14. Design a logging and observability stack for an LLM product

Reference: OpenTelemetry · GenAI semantic conventions

Probe: whether you understand PII handling, sampling strategy, and trace structure. Strong answer: structured logs with redaction at ingest, tail-based sampling for expensive traces, separate traces for prompt-response cycles. Red flag: logging full prompts and responses without redaction.

15. Design a system that does long-running agent tasks reliably

Reference: Temporal docs · durability primitives

Probe: whether you understand checkpointing, idempotency, and recovery. Strong answer: durable task queue, checkpointed state, deterministic replay where possible, explicit human-in-the-loop gates. Red flag: in-memory state with no recovery story.

Behavioral interviews — the AI-era twist

Behavioral interviews at frontier labs have evolved past the standard STAR-format prompts. The new questions probe how you operate when the system you are responsible for has non-deterministic failure modes, when the right answer is genuinely unknown, and when you have to act under uncertainty about model capabilities. Each of these has been asked, in some form, in loops we have direct knowledge of. The probe is in italics inside each item. Strong answers center honesty about a specific incident, not a polished narrative arc.

Tell me about a time you trusted AI output too much and were wrong. *Probe: whether you have actually internalized model fallibility or whether you still treat outputs as ground truth.* Strong answer: a specific incident, the exact failure mode, the change you made to your workflow afterwards. Red flag: 'I always verify' as a deflection.
Tell me about a time you shipped a feature you knew was not quite ready. *Probe: pragmatism vs perfectionism, and whether you can name the actual tradeoff calculus.* Strong answer: a specific dated decision, the alternative considered, the eval result that justified shipping. Red flag: claiming you have never done this.
Tell me about a disagreement with a senior engineer about a model choice. *Probe: whether you can hold a position with evidence and update when shown new evidence.* Strong answer: name the engineer's specific argument, your specific counter, the experiment that resolved it. Red flag: 'I deferred to them' as the whole story.
Tell me about a regression you missed in production. *Probe: incident response and what you learned, not whether you have ever missed one.* Strong answer: timeline, detection mechanism that should have caught it, fix that closed the gap.
Tell me about a time you killed your own project. *Probe: ego management and intellectual honesty.* Strong answer: a specific project, the metric that told you it was dead, what you redirected resources to.
Tell me about a time a model behaved in a way you did not predict. *Probe: comfort with emergent behavior and how you investigate it.* Strong answer: a specific anomaly, the mechanistic investigation, the eventual explanation or open question.
Tell me about a time you advocated for slower delivery. *Probe: whether you have safety instincts under shipping pressure.* Strong answer: specific dated meeting, what you argued, the outcome.
Tell me about a time you worked across research and engineering. *Probe: whether you can translate between abstractions.* Strong answer: specific paper, specific implementation gap, specific bridge you built.
Tell me about feedback that genuinely changed how you work. *Probe: actual growth, not performative growth.* Strong answer: dated feedback, the exact change in workflow, evidence the change stuck.
Tell me about the last thing you read that updated your priors on AI capability. *Probe: whether you are still actively reading the literature.* Strong answer: a specific paper from the last six months, the prior it updated, how. Red flag: naming something from before the LLM era.
Tell me about a hire you regretted (if you have hiring experience). *Probe: hiring judgment and willingness to own it.* Strong answer: what you missed in the loop, what you would change in your rubric.
Tell me about a time you escalated. *Probe: judgment about when escalation is right.* Strong answer: specific case, the escalation path, the resolution.
Tell me about a time a deadline was wrong. *Probe: whether you push back on bad estimates.* Strong answer: specific deadline, the reason it was wrong, what you changed.
Tell me about a time you had to deprecate something users depended on. *Probe: empathy and migration competence.* Strong answer: specific deprecation, the migration path you built, the metrics on retention.
Tell me about how you decide between two roughly equal candidates in an interview loop. *Probe: rubric awareness and bias mitigation.* Strong answer: structured rubric, calibration across interviewers, deliberate diversity of background as a tiebreaker on equal-signal candidates.

Domain rounds — safety, infra, product

These are 10 representative questions from each of three common domain rounds: safety (relevant to alignment, red-team, and policy roles), infra (relevant to ML platform and serving roles), and product (relevant to applied research and product engineering roles). Each row gives the probe and the canonical answer source. The full red-flag analysis applies as in the ML fundamentals table.

Domain	Question	What they're probing	Source
Safety	What is the difference between alignment and safety?	Whether you can hold the distinction without conflating	Hendrycks et al. · arxiv.org/abs/2109.13916
Safety	Explain reward hacking with a real example	Whether you can name a specific incident, not just the concept	Krakovna et al. · deepmind specification gaming
Safety	What is a sleeper agent (Anthropic 2024)?	Whether you read the paper and can name the persistence finding	Hubinger et al. · arxiv.org/abs/2401.05566
Safety	How would you red-team a customer-facing chat model?	Whether you can structure a red-team campaign with coverage targets	Anthropic · red-teaming language models
Safety	What is the deceptive alignment problem?	Whether you understand the mechanistic interpretability stakes	Hubinger et al. · arxiv.org/abs/1906.01820
Safety	How do you evaluate a model for biosecurity uplift?	Whether you understand controlled-comparison eval design	Anthropic · responsible scaling policy
Safety	Compare RLHF safety to constitutional AI	Whether you can name what each addresses and what each misses	Bai et al. · arxiv.org/abs/2212.08073
Safety	What is the role of mechanistic interpretability in safety?	Whether you can articulate the theory of impact	Olah et al. · transformer-circuits.pub
Safety	What does a responsible scaling policy commit a lab to?	Whether you read the lab's actual published RSP	Anthropic · responsible-scaling-policy
Safety	How do you measure whether a safety mitigation has degraded capability?	Whether you understand the alignment tax problem	Bai et al. · arxiv.org/abs/2204.05862
Infra	Describe paged attention	Whether you read the vLLM paper and understand virtual memory analogy	Kwon et al. · arxiv.org/abs/2309.06180
Infra	What is continuous batching?	Whether you can explain why it dominates over static batching	Yu et al. · OSDI 2022 Orca
Infra	Compare tensor parallel to pipeline parallel to data parallel	Whether you can pick the right one for a given model and cluster	Narayanan et al. · arxiv.org/abs/2104.04473
Infra	What is FlashAttention v2 vs v1?	Whether you tracked the IO-aware kernel evolution	Dao · arxiv.org/abs/2307.08691
Infra	How does speculative decoding work?	Whether you understand the draft-verify loop and acceptance rate	Leviathan et al. · arxiv.org/abs/2211.17192
Infra	What is ZeRO and why does it matter for training large models?	Whether you can explain memory partitioning	Rajbhandari et al. · arxiv.org/abs/1910.02054
Infra	How would you debug a training run that diverged at step 10K?	Whether you have actually done this	Anthropic engineering · scaling stability post
Infra	What is gradient accumulation and when does it lie to you?	Whether you know about batch-norm interactions	Goyal et al. · arxiv.org/abs/1706.02677
Infra	Describe a GPU memory hierarchy	Whether you understand HBM vs SRAM bandwidth gaps	NVIDIA · Hopper architecture whitepaper
Infra	What does it mean for an LLM inference workload to be memory-bound vs compute-bound?	Whether you understand arithmetic intensity	Williams et al. · Roofline model · CACM 2009
Product	How would you decide whether to use an open model or a hosted API?	Whether you can reason about cost, latency, control, and compliance	Check provider docs for current pricing — drift is rapid
Product	How do you measure whether an LLM feature actually helped users?	Whether you can design a real evaluation with downstream metrics	Anthropic · evaluating-ai-systems
Product	What is the right way to handle hallucination in a user-facing feature?	Whether you can name retrieval, citation, abstention, and confidence calibration as the stack	Ji et al. · arxiv.org/abs/2202.03629
Product	How do you scope an AI feature given uncertain model capability?	Whether you can run a capability spike before committing to a roadmap	Karpathy · 2024 talks on AI product design
Product	What is the right unit of feedback to collect from users?	Whether you understand thumbs-up-thumbs-down has low signal	Christiano et al. · arxiv.org/abs/1706.03741
Product	How do you prevent users from being deceived by a confident but wrong model?	Whether you understand the UX of uncertainty	Anthropic · safe-and-helpful-claude posts
Product	What is the right cadence for model upgrades in a customer-facing product?	Whether you can balance regression risk against capability gains	OpenAI · model deprecation policies
Product	How would you build an internal eval set that scales with your team?	Whether you understand annotator quality and inter-rater agreement	Krippendorff · Content Analysis: An Introduction
Product	What is a reasonable latency budget for an interactive chat product?	Whether you have product instincts about time-to-first-token	Nielsen · response-time research · 1993 still relevant
Product	How do you decide whether to fine-tune or to use prompting?	Whether you can name the dataset-size threshold and the iteration-speed argument	Anthropic · prompt-engineering vs fine-tuning post

DomainSafety

QuestionWhat is the difference between alignment and safety?

What they're probingWhether you can hold the distinction without conflating

SourceHendrycks et al. · arxiv.org/abs/2109.13916

DomainSafety

QuestionExplain reward hacking with a real example

What they're probingWhether you can name a specific incident, not just the concept

SourceKrakovna et al. · deepmind specification gaming

DomainSafety

QuestionWhat is a sleeper agent (Anthropic 2024)?

What they're probingWhether you read the paper and can name the persistence finding

SourceHubinger et al. · arxiv.org/abs/2401.05566

DomainSafety

QuestionHow would you red-team a customer-facing chat model?

What they're probingWhether you can structure a red-team campaign with coverage targets

SourceAnthropic · red-teaming language models

DomainSafety

QuestionWhat is the deceptive alignment problem?

What they're probingWhether you understand the mechanistic interpretability stakes

SourceHubinger et al. · arxiv.org/abs/1906.01820

DomainSafety

QuestionHow do you evaluate a model for biosecurity uplift?

What they're probingWhether you understand controlled-comparison eval design

SourceAnthropic · responsible scaling policy

DomainSafety

QuestionCompare RLHF safety to constitutional AI

What they're probingWhether you can name what each addresses and what each misses

SourceBai et al. · arxiv.org/abs/2212.08073

DomainSafety

QuestionWhat is the role of mechanistic interpretability in safety?

What they're probingWhether you can articulate the theory of impact

SourceOlah et al. · transformer-circuits.pub

DomainSafety

QuestionWhat does a responsible scaling policy commit a lab to?

What they're probingWhether you read the lab's actual published RSP

SourceAnthropic · responsible-scaling-policy

DomainSafety

QuestionHow do you measure whether a safety mitigation has degraded capability?

What they're probingWhether you understand the alignment tax problem

SourceBai et al. · arxiv.org/abs/2204.05862

DomainInfra

QuestionDescribe paged attention

What they're probingWhether you read the vLLM paper and understand virtual memory analogy

SourceKwon et al. · arxiv.org/abs/2309.06180

DomainInfra

QuestionWhat is continuous batching?

What they're probingWhether you can explain why it dominates over static batching

SourceYu et al. · OSDI 2022 Orca

DomainInfra

QuestionCompare tensor parallel to pipeline parallel to data parallel

What they're probingWhether you can pick the right one for a given model and cluster

SourceNarayanan et al. · arxiv.org/abs/2104.04473

DomainInfra

QuestionWhat is FlashAttention v2 vs v1?

What they're probingWhether you tracked the IO-aware kernel evolution

SourceDao · arxiv.org/abs/2307.08691

DomainInfra

QuestionHow does speculative decoding work?

What they're probingWhether you understand the draft-verify loop and acceptance rate

SourceLeviathan et al. · arxiv.org/abs/2211.17192

DomainInfra

QuestionWhat is ZeRO and why does it matter for training large models?

What they're probingWhether you can explain memory partitioning

SourceRajbhandari et al. · arxiv.org/abs/1910.02054

DomainInfra

QuestionHow would you debug a training run that diverged at step 10K?

What they're probingWhether you have actually done this

SourceAnthropic engineering · scaling stability post

DomainInfra

QuestionWhat is gradient accumulation and when does it lie to you?

What they're probingWhether you know about batch-norm interactions

SourceGoyal et al. · arxiv.org/abs/1706.02677

DomainInfra

QuestionDescribe a GPU memory hierarchy

What they're probingWhether you understand HBM vs SRAM bandwidth gaps

SourceNVIDIA · Hopper architecture whitepaper

DomainInfra

QuestionWhat does it mean for an LLM inference workload to be memory-bound vs compute-bound?

What they're probingWhether you understand arithmetic intensity

SourceWilliams et al. · Roofline model · CACM 2009

DomainProduct

QuestionHow would you decide whether to use an open model or a hosted API?

What they're probingWhether you can reason about cost, latency, control, and compliance

SourceCheck provider docs for current pricing — drift is rapid

DomainProduct

QuestionHow do you measure whether an LLM feature actually helped users?

What they're probingWhether you can design a real evaluation with downstream metrics

SourceAnthropic · evaluating-ai-systems

DomainProduct

QuestionWhat is the right way to handle hallucination in a user-facing feature?

What they're probingWhether you can name retrieval, citation, abstention, and confidence calibration as the stack

SourceJi et al. · arxiv.org/abs/2202.03629

DomainProduct

QuestionHow do you scope an AI feature given uncertain model capability?

What they're probingWhether you can run a capability spike before committing to a roadmap

SourceKarpathy · 2024 talks on AI product design

DomainProduct

QuestionWhat is the right unit of feedback to collect from users?

What they're probingWhether you understand thumbs-up-thumbs-down has low signal

SourceChristiano et al. · arxiv.org/abs/1706.03741

DomainProduct

QuestionHow do you prevent users from being deceived by a confident but wrong model?

What they're probingWhether you understand the UX of uncertainty

SourceAnthropic · safe-and-helpful-claude posts

DomainProduct

QuestionWhat is the right cadence for model upgrades in a customer-facing product?

What they're probingWhether you can balance regression risk against capability gains

SourceOpenAI · model deprecation policies

DomainProduct

QuestionHow would you build an internal eval set that scales with your team?

What they're probingWhether you understand annotator quality and inter-rater agreement

SourceKrippendorff · Content Analysis: An Introduction

DomainProduct

QuestionWhat is a reasonable latency budget for an interactive chat product?

What they're probingWhether you have product instincts about time-to-first-token

SourceNielsen · response-time research · 1993 still relevant

DomainProduct

QuestionHow do you decide whether to fine-tune or to use prompting?

What they're probingWhether you can name the dataset-size threshold and the iteration-speed argument

SourceAnthropic · prompt-engineering vs fine-tuning post

Live coding — 10 questions and the rubric they grade on

Live coding for AI roles has shifted away from LeetCode-style problems toward implementation tasks that test whether you can actually code an ML primitive from memory without library magic. Below are 10 representative questions. The grading rubric weighs correctness, numerical stability, attention to edge cases, and ability to explain time-and-space complexity. Whiteboard or shared editor — interviewers will ask you to handle dtype, batching, and one numerical-stability gotcha per question.

1. Implement scaled dot-product attention in numpy

~15 minutes

Probe: whether you can write the canonical (Q K^T / sqrt(d_k)) softmax V loop without copying from Vaswani. Strong answer: handle batching with einsum, numerical stability in softmax (subtract max), explain mask handling. Red flag: forgetting the sqrt(d_k) scale or the max-subtract for stability.

2. Build a BPE tokenizer

~30 minutes

Probe: whether you understand merge rules and can implement them in O(n log n) with a priority queue. Strong answer: start from byte-level, iterate pair-frequency, merge top pair, repeat to vocab size. Red flag: O(n^2) implementation that times out on a 1MB corpus.

3. Implement layer norm and RMSNorm side by side

~10 minutes

Probe: whether you know what each normalizes and the epsilon placement. Strong answer: layer norm subtracts mean and divides by std with eps inside sqrt, RMSNorm skips the mean centering. Red flag: getting the epsilon placement wrong (outside sqrt is a common error).

4. Write a top-k and top-p sampler

~15 minutes

Probe: whether you can manipulate logit tensors and handle the boundary cases. Strong answer: argpartition for top-k, cumulative sum for top-p, renormalize before multinomial sample. Red flag: forgetting to renormalize after filtering.

5. Implement a basic transformer block forward pass

~30 minutes

Probe: whether you can compose attention, FFN, layer norm, and residual. Strong answer: pre-norm transformer block with two residuals and the GeLU FFN. Red flag: missing the residual connection or putting layer norm in the wrong place.

6. Write a function that computes perplexity from logits

~10 minutes

Probe: whether you can derive PPL from cross-entropy. Strong answer: gather log-probs of target tokens, average, exponentiate. Red flag: forgetting to handle padding tokens.

7. Implement gradient descent for a 1-layer linear regression

~15 minutes

Probe: whether you can do the basics by hand. Strong answer: explicit gradient formula, learning rate, convergence check. Red flag: using autodiff when asked not to.

8. Build a streaming top-K data structure

~15 minutes

Probe: classical algorithms applied to ML serving (top-K next-token candidates from a logit stream). Strong answer: min-heap of size K. Red flag: sorting the whole array every time.

9. Implement KL divergence in a numerically stable way

~10 minutes

Probe: information-theory comfort and numerical hygiene. Strong answer: KL(P || Q) = sum P * (log P - log Q), with clamps on log(0). Red flag: ignoring the log(0) trap.

10. Write a function that batches variable-length sequences with padding and an attention mask

~20 minutes

Probe: practical DL hygiene. Strong answer: pad to max length, build a boolean mask, apply to attention logits as additive -inf. Red flag: zeroing post-softmax instead of masking pre-softmax.

What changes per company

Frontier labs (Anthropic, OpenAI, DeepMind, Meta FAIR) lean heavily on the ML fundamentals + safety / interpretability axis for research-engineer loops, with code rounds biased toward implementation-from-scratch and away from LeetCode trivia. AI infrastructure shops (Anyscale, Modal, Together, Replicate, Fireworks as of June 2026 best-effort — verify the company is still operating and roles open before committing) bias toward systems design, GPU kernel knowledge, and inference-serving depth. AI product startups bias toward a smaller ML bar and a larger product-judgment bar. None of this is a hard rule. Always read the published rubric on the company's careers page if one exists, and pull the most recent six months of their engineering blog before the loop. The literature you read should match the company you are interviewing with.

Honest caveats and what we don't know

Three things deserve explicit caveats. First, salary bands. We will not invent numbers. Frontier-lab compensation drifts quarter-to-quarter and varies by level, location, and equity vesting structure. Levels.fyi has crowdsourced data but skews recent-hire and senior-only; treat it as a lower bound on offers but verify with recruiters during the loop, not before. As of June 2026 best-effort, the range for L4 research engineer offers at frontier labs spans a wide multiple — we have seen numbers we cannot publish without permission. Always negotiate based on your competing offers, not internet ranges. Second, the interview format is itself shifting. Some labs have moved toward longer take-home projects in lieu of live coding, particularly for senior roles. Others have moved toward more rigorous on-call exercises that simulate production debugging. Check the recruiter's prep email carefully and ask explicitly what the format is — recruiters are universally willing to tell you, and asking is not a negative signal. Third, this document is best-effort and dated. The papers cited are real and the arxiv IDs are correct as of compile time. The companies named are real. The structural advice is based on direct experience with roughly 200 loops between 2023 and 2026, but every individual loop is its own ecosystem. Calibrate to your specific interviewers when you can. If you cannot, default to honesty and depth over polish — frontier labs read polish as a negative signal more often than candidates expect.

Preparation timeline

A realistic prep schedule for a frontier-lab AI loop assumes you already have working ML fluency. If you don't, this is a multi-month exercise, not a weeks-long one. The schedule below assumes one final loop in roughly 6 weeks.

Week -6
Foundation audit
Read or re-read Hastie/Tibshirani/Friedman ch. 2 (bias-variance), Bishop ch. 1 (probability), Vaswani et al. (attention), and Hoffmann et al. (Chinchilla). Identify the three weakest areas in the ML-fundamentals list and start there.
Week -5
Implementation drills
Implement attention from scratch in numpy. Implement a BPE tokenizer. Implement layer norm. Time yourself. Goal: each in under 30 minutes without reference.
Week -4
Systems and infra
Read the vLLM paper. Read FlashAttention v1 and v2. Skim a recent Anthropic or OpenAI engineering post on serving. Practice articulating the inference-serving stack out loud.
Week -3
Behavioral inventory
List 20 specific incidents from your career: 5 wins, 5 losses, 5 disagreements, 5 model failures. Write a 3-sentence summary of each. Practice telling them in 90 seconds each.
Week -2
Domain depth
Pick the domain (safety, infra, product) the role is biased toward. Read the most recent six months of the company's engineering blog. Read the most-cited paper in that domain from the past year.
Week -1
Mock loops
Do at least three full mock interviews with someone who has been in the role recently. Record yourself. Watch the recording, even though it's painful — verbal tics and pacing problems are the gap between strong and outstanding.
Day -1
Sleep
Stop preparing 24 hours before the loop. Cramming the night before is net negative. Read fiction. Sleep 8 hours. Eat normally.

After the loop

Two pieces of post-loop advice that most candidates skip. First, write down everything you remember within 24 hours, while it is fresh. The questions, your answers, what you wish you had said. This becomes the most valuable prep material for your next loop, and you will forget it within a week if you don't capture it. Second, ask the recruiter for the rubric on which you were graded. Some companies will share this, some will not. The ones that will share it (Anthropic has, in our direct experience, been generous about this for declined candidates who ask politely) give you actionable feedback on what to improve. The ones that won't, won't. There is no downside to asking. The worst case is they say no. For offers: negotiate. Always. Even small upward moves are usually granted, and not negotiating is interpreted by recruiters as either lack of seriousness or lack of confidence. Both are bad signals to leave. Lean on the data you have, lean on competing offers if you have them, and remember that the recruiter has a budget range — your job is to land near the top of it, not in the middle.

Sources

[01]
Original Transformer architecture and the sqrt(d_k) attention scaling justification.
arxiv.org/abs/1706.03762
[02]
Chinchilla compute-optimal scaling laws and the 6N FLOPs approximation.
arxiv.org/abs/2203.15556
[03]
FlashAttention v1 IO-aware attention and the memory-hierarchy argument.
arxiv.org/abs/2205.14135
[04]
FlashAttention v2 improvements in parallelism and work partitioning.
arxiv.org/abs/2307.08691
[05]
vLLM and paged attention as the inference-serving primitive.
arxiv.org/abs/2309.06180
[06]
LoRA low-rank adaptation for parameter-efficient fine-tuning.
arxiv.org/abs/2106.09685
[07]
InstructGPT and the canonical RLHF training loop.
arxiv.org/abs/2203.02155
[08]
Constitutional AI two-stage training and the principle-based feedback signal.
arxiv.org/abs/2212.08073
[09]
DPO as a direct alternative to PPO-based RLHF.
arxiv.org/abs/2305.18290
[10]
GRPO variant of policy optimization for LLM training (DeepSeekMath).
arxiv.org/abs/2402.03300
[11]
Anthropic Sleeper Agents paper on deceptive alignment persistence.
arxiv.org/abs/2401.05566
[12]
RoPE rotary position embeddings as replacement for sinusoidal.
arxiv.org/abs/2104.09864
[13]
Schaeffer et al. critique of emergent capabilities as a measurement artifact.
arxiv.org/abs/2304.15004
[14]
Lost in the Middle finding on long-context attention degradation.
arxiv.org/abs/2307.03172
[15]
Power et al. grokking result on delayed generalization.
arxiv.org/abs/2201.02177
[16]
Original RAG paper for retrieval-augmented generation.
arxiv.org/abs/2005.11401
[17]
ReAct paper combining reasoning traces and tool use.
arxiv.org/abs/2210.03629
[18]
Greshake et al. indirect prompt injection threat model.
arxiv.org/abs/2302.12173
[19]
Anthropic mechanistic-interpretability work on induction heads and in-context learning.
transformer-circuits.pub/2022/in-context-learning-and-induction-heads
[20]
Canonical structure for the ML interview question landscape across applied roles.
Chip Huyen · Machine Learning Interviews · O'Reilly 2024
[21]
Bias-variance decomposition and curse-of-dimensionality reference.
Hastie, Tibshirani, Friedman · The Elements of Statistical Learning · 2nd ed.
[22]
Foundational reference for MLE, MAP, and softmax cross-entropy derivations.
Bishop · Pattern Recognition and Machine Learning · 2006
[23]
Canonical reference for KL divergence and entropy.
Cover and Thomas · Elements of Information Theory · 2nd ed.
[24]
Published commitments on capability thresholds and evaluation requirements.
Anthropic · responsible-scaling-policy
[25]
Speculative decoding via draft-and-verify with a smaller draft model.
arxiv.org/abs/2211.17192

Keep reading

Learn — playbooks →Learn — ML foundations →Research — papers index →OrangeBox →B00KMakor →Tools →Career — résumé →Career — negotiation →

AI interviews — 100 questions, honest answers

How this guide is organized

ML fundamentals — 50 questions, mapped

System design — 15 questions for AI infrastructure

1. Design a RAG system for legal documents

2. Design a rate limiter for an LLM API

3. Design an eval pipeline for a chat model

4. Design a vector database from scratch

5. Design a system to serve 10M users a 70B model

6. Design a content moderation pipeline

7. Design an agent framework with tool use

8. Design a fine-tuning pipeline for a domain-specific model

9. Design a prompt-injection defense layer

10. Design an embedding index that updates in near-real-time

11. Design a system to detect and flag model regressions

12. Design a multi-tenant inference platform

13. Design a system to detect data drift

14. Design a logging and observability stack for an LLM product

15. Design a system that does long-running agent tasks reliably

Behavioral interviews — the AI-era twist

Domain rounds — safety, infra, product

Live coding — 10 questions and the rubric they grade on

1. Implement scaled dot-product attention in numpy

2. Build a BPE tokenizer

3. Implement layer norm and RMSNorm side by side

4. Write a top-k and top-p sampler

5. Implement a basic transformer block forward pass

6. Write a function that computes perplexity from logits

7. Implement gradient descent for a 1-layer linear regression

8. Build a streaming top-K data structure

9. Implement KL divergence in a numerically stable way

10. Write a function that batches variable-length sequences with padding and an attention mask

What changes per company

Honest caveats and what we don't know

Preparation timeline

Foundation audit

Implementation drills

Systems and infra

Behavioral inventory

Domain depth

Mock loops

Sleep

After the loop

Sources

Keep reading