
AI interviews — 100 questions, honest answers
The prep guide we wish existed when we were on the other side of the table
How this guide is organized
ML fundamentals — 50 questions, mapped
These are the 50 ML-fundamentals questions we see most frequently across applied-ML, research-engineer, and ML-engineer loops. Difficulty is calibrated to the L4–L6 range at frontier labs. The 'probe' column is what the interviewer is actually scoring, which is often different from the literal question. Citations point to the canonical source where one exists.
System design — 15 questions for AI infrastructure
System design rounds for AI roles diverge sharply from classical web-scale design rounds. You will be expected to reason about retrieval quality, eval pipelines, and the fact that your system's behavior is statistical rather than deterministic. Below are 15 representative questions with the same four-field annotation pattern. We have collapsed each into a card; the full probe-and-red-flag breakdown is the same structure as the ML fundamentals table.
1. Design a RAG system for legal documents
Reference: Lewis et al. · arxiv.org/abs/2005.11401
Probe: whether you understand chunking strategy, embedding model selection, hybrid retrieval (BM25 + dense), reranking, and the eval pipeline. Strong answer: name a chunking strategy with overlap, justify embedding choice with cost/quality tradeoff, address citation accuracy as the primary eval. Red flag: skipping eval entirely or proposing a single dense retrieval pass with no reranking.
2. Design a rate limiter for an LLM API
Reference: OpenAI rate-limit docs · platform.openai.com
Probe: whether you can reason about token-based vs request-based limits and the fairness problem when prompts vary 100x in length. Strong answer: token bucket per user, separate budgets for input and output tokens, queue with bounded wait. Red flag: only naming requests-per-second.
3. Design an eval pipeline for a chat model
Reference: Anthropic · evaluating-ai-systems
Probe: whether you can distinguish offline benchmarks from online A/B tests and reason about cost. Strong answer: tiered eval — fast automated benchmarks on every commit, slower model-graded evals nightly, human eval on a sampled rollout. Red flag: proposing one giant benchmark and stopping there.
4. Design a vector database from scratch
Reference: Malkov and Yashunin · arxiv.org/abs/1603.09320
Probe: whether you understand HNSW, IVF, product quantization, and the recall-latency-memory triangle. Strong answer: pick HNSW for sub-ms latency at moderate scale, explain layered graph construction. Red flag: 'just use Pinecone.'
5. Design a system to serve 10M users a 70B model
Reference: vLLM paper · arxiv.org/abs/2309.06180
Probe: whether you can reason about batching, KV cache management, speculative decoding, and tensor parallelism. Strong answer: continuous batching, paged KV cache (vLLM-style), spec decoding with a small draft model. Red flag: ignoring batching entirely.
6. Design a content moderation pipeline
Reference: Markov et al. · arxiv.org/abs/2208.03274
Probe: whether you understand the safety stack — small classifiers as a first line, larger model for ambiguous cases, human review for the long tail. Strong answer: tiered defense with explicit precision-recall targets per tier. Red flag: a single LLM call gating everything.
7. Design an agent framework with tool use
Reference: ReAct · arxiv.org/abs/2210.03629
Probe: whether you understand the planning loop, tool schemas, error recovery, and the loop-termination problem. Strong answer: structured tool schemas, retries with bounded depth, explicit termination conditions. Red flag: not addressing infinite-loop safety.
8. Design a fine-tuning pipeline for a domain-specific model
Reference: Lee et al. · arxiv.org/abs/2107.06499
Probe: whether you can specify data quality gates, eval strategy, and the LoRA-vs-full decision. Strong answer: dedup, deduplication, quality filters, then small-scale LoRA pilot before scaling. Red flag: skipping the dedup and quality gating.
9. Design a prompt-injection defense layer
Reference: Greshake et al. · arxiv.org/abs/2302.12173
Probe: whether you understand that the threat is real and that no single defense is complete. Strong answer: input sanitization, structured prompts, output filtering, plus monitoring for anomalous patterns. Red flag: claiming any single technique solves it.
10. Design an embedding index that updates in near-real-time
Reference: Pinecone engineering blog · 2023 indexing
Probe: whether you understand the rebuild-vs-incremental tradeoff in HNSW. Strong answer: write-ahead log of updates, periodic full rebuilds, two-index swap for fresh data. Red flag: pretending HNSW supports easy deletion.
11. Design a system to detect and flag model regressions
Reference: Google · ml-test-score paper
Probe: whether you can specify a regression test suite for a non-deterministic system. Strong answer: golden-set evals, statistical significance thresholds, automatic alerting with bounded false positives. Red flag: 'we'll just look at the output.'
12. Design a multi-tenant inference platform
Reference: Anyscale · serving blog 2024
Probe: whether you understand isolation, fairness, and noisy neighbor problems. Strong answer: per-tenant queues, shared KV cache with tenant-aware eviction, isolation at the request level. Red flag: ignoring noisy neighbors entirely.
13. Design a system to detect data drift
Reference: Rabanser et al. · arxiv.org/abs/1810.11953
Probe: whether you understand statistical tests for distribution shift and the cost of false alarms. Strong answer: PSI or KS tests on input features, EMD on embeddings, action thresholds tuned to retraining cost. Red flag: continuous retraining without a trigger.
14. Design a logging and observability stack for an LLM product
Reference: OpenTelemetry · GenAI semantic conventions
Probe: whether you understand PII handling, sampling strategy, and trace structure. Strong answer: structured logs with redaction at ingest, tail-based sampling for expensive traces, separate traces for prompt-response cycles. Red flag: logging full prompts and responses without redaction.
15. Design a system that does long-running agent tasks reliably
Reference: Temporal docs · durability primitives
Probe: whether you understand checkpointing, idempotency, and recovery. Strong answer: durable task queue, checkpointed state, deterministic replay where possible, explicit human-in-the-loop gates. Red flag: in-memory state with no recovery story.
Behavioral interviews — the AI-era twist
Behavioral interviews at frontier labs have evolved past the standard STAR-format prompts. The new questions probe how you operate when the system you are responsible for has non-deterministic failure modes, when the right answer is genuinely unknown, and when you have to act under uncertainty about model capabilities. Each of these has been asked, in some form, in loops we have direct knowledge of. The probe is in italics inside each item. Strong answers center honesty about a specific incident, not a polished narrative arc.
- Tell me about a time you trusted AI output too much and were wrong. *Probe: whether you have actually internalized model fallibility or whether you still treat outputs as ground truth.* Strong answer: a specific incident, the exact failure mode, the change you made to your workflow afterwards. Red flag: 'I always verify' as a deflection.
- Tell me about a time you shipped a feature you knew was not quite ready. *Probe: pragmatism vs perfectionism, and whether you can name the actual tradeoff calculus.* Strong answer: a specific dated decision, the alternative considered, the eval result that justified shipping. Red flag: claiming you have never done this.
- Tell me about a disagreement with a senior engineer about a model choice. *Probe: whether you can hold a position with evidence and update when shown new evidence.* Strong answer: name the engineer's specific argument, your specific counter, the experiment that resolved it. Red flag: 'I deferred to them' as the whole story.
- Tell me about a regression you missed in production. *Probe: incident response and what you learned, not whether you have ever missed one.* Strong answer: timeline, detection mechanism that should have caught it, fix that closed the gap.
- Tell me about a time you killed your own project. *Probe: ego management and intellectual honesty.* Strong answer: a specific project, the metric that told you it was dead, what you redirected resources to.
- Tell me about a time a model behaved in a way you did not predict. *Probe: comfort with emergent behavior and how you investigate it.* Strong answer: a specific anomaly, the mechanistic investigation, the eventual explanation or open question.
- Tell me about a time you advocated for slower delivery. *Probe: whether you have safety instincts under shipping pressure.* Strong answer: specific dated meeting, what you argued, the outcome.
- Tell me about a time you worked across research and engineering. *Probe: whether you can translate between abstractions.* Strong answer: specific paper, specific implementation gap, specific bridge you built.
- Tell me about feedback that genuinely changed how you work. *Probe: actual growth, not performative growth.* Strong answer: dated feedback, the exact change in workflow, evidence the change stuck.
- Tell me about the last thing you read that updated your priors on AI capability. *Probe: whether you are still actively reading the literature.* Strong answer: a specific paper from the last six months, the prior it updated, how. Red flag: naming something from before the LLM era.
- Tell me about a hire you regretted (if you have hiring experience). *Probe: hiring judgment and willingness to own it.* Strong answer: what you missed in the loop, what you would change in your rubric.
- Tell me about a time you escalated. *Probe: judgment about when escalation is right.* Strong answer: specific case, the escalation path, the resolution.
- Tell me about a time a deadline was wrong. *Probe: whether you push back on bad estimates.* Strong answer: specific deadline, the reason it was wrong, what you changed.
- Tell me about a time you had to deprecate something users depended on. *Probe: empathy and migration competence.* Strong answer: specific deprecation, the migration path you built, the metrics on retention.
- Tell me about how you decide between two roughly equal candidates in an interview loop. *Probe: rubric awareness and bias mitigation.* Strong answer: structured rubric, calibration across interviewers, deliberate diversity of background as a tiebreaker on equal-signal candidates.
Domain rounds — safety, infra, product
These are 10 representative questions from each of three common domain rounds: safety (relevant to alignment, red-team, and policy roles), infra (relevant to ML platform and serving roles), and product (relevant to applied research and product engineering roles). Each row gives the probe and the canonical answer source. The full red-flag analysis applies as in the ML fundamentals table.
Live coding — 10 questions and the rubric they grade on
Live coding for AI roles has shifted away from LeetCode-style problems toward implementation tasks that test whether you can actually code an ML primitive from memory without library magic. Below are 10 representative questions. The grading rubric weighs correctness, numerical stability, attention to edge cases, and ability to explain time-and-space complexity. Whiteboard or shared editor — interviewers will ask you to handle dtype, batching, and one numerical-stability gotcha per question.
1. Implement scaled dot-product attention in numpy
~15 minutes
Probe: whether you can write the canonical (Q K^T / sqrt(d_k)) softmax V loop without copying from Vaswani. Strong answer: handle batching with einsum, numerical stability in softmax (subtract max), explain mask handling. Red flag: forgetting the sqrt(d_k) scale or the max-subtract for stability.
2. Build a BPE tokenizer
~30 minutes
Probe: whether you understand merge rules and can implement them in O(n log n) with a priority queue. Strong answer: start from byte-level, iterate pair-frequency, merge top pair, repeat to vocab size. Red flag: O(n^2) implementation that times out on a 1MB corpus.
3. Implement layer norm and RMSNorm side by side
~10 minutes
Probe: whether you know what each normalizes and the epsilon placement. Strong answer: layer norm subtracts mean and divides by std with eps inside sqrt, RMSNorm skips the mean centering. Red flag: getting the epsilon placement wrong (outside sqrt is a common error).
4. Write a top-k and top-p sampler
~15 minutes
Probe: whether you can manipulate logit tensors and handle the boundary cases. Strong answer: argpartition for top-k, cumulative sum for top-p, renormalize before multinomial sample. Red flag: forgetting to renormalize after filtering.
5. Implement a basic transformer block forward pass
~30 minutes
Probe: whether you can compose attention, FFN, layer norm, and residual. Strong answer: pre-norm transformer block with two residuals and the GeLU FFN. Red flag: missing the residual connection or putting layer norm in the wrong place.
6. Write a function that computes perplexity from logits
~10 minutes
Probe: whether you can derive PPL from cross-entropy. Strong answer: gather log-probs of target tokens, average, exponentiate. Red flag: forgetting to handle padding tokens.
7. Implement gradient descent for a 1-layer linear regression
~15 minutes
Probe: whether you can do the basics by hand. Strong answer: explicit gradient formula, learning rate, convergence check. Red flag: using autodiff when asked not to.
8. Build a streaming top-K data structure
~15 minutes
Probe: classical algorithms applied to ML serving (top-K next-token candidates from a logit stream). Strong answer: min-heap of size K. Red flag: sorting the whole array every time.
9. Implement KL divergence in a numerically stable way
~10 minutes
Probe: information-theory comfort and numerical hygiene. Strong answer: KL(P || Q) = sum P * (log P - log Q), with clamps on log(0). Red flag: ignoring the log(0) trap.
10. Write a function that batches variable-length sequences with padding and an attention mask
~20 minutes
Probe: practical DL hygiene. Strong answer: pad to max length, build a boolean mask, apply to attention logits as additive -inf. Red flag: zeroing post-softmax instead of masking pre-softmax.
What changes per company
Frontier labs (Anthropic, OpenAI, DeepMind, Meta FAIR) lean heavily on the ML fundamentals + safety / interpretability axis for research-engineer loops, with code rounds biased toward implementation-from-scratch and away from LeetCode trivia. AI infrastructure shops (Anyscale, Modal, Together, Replicate, Fireworks as of June 2026 best-effort — verify the company is still operating and roles open before committing) bias toward systems design, GPU kernel knowledge, and inference-serving depth. AI product startups bias toward a smaller ML bar and a larger product-judgment bar. None of this is a hard rule. Always read the published rubric on the company's careers page if one exists, and pull the most recent six months of their engineering blog before the loop. The literature you read should match the company you are interviewing with.
Honest caveats and what we don't know
Preparation timeline
A realistic prep schedule for a frontier-lab AI loop assumes you already have working ML fluency. If you don't, this is a multi-month exercise, not a weeks-long one. The schedule below assumes one final loop in roughly 6 weeks.
Week -6
Foundation audit
Read or re-read Hastie/Tibshirani/Friedman ch. 2 (bias-variance), Bishop ch. 1 (probability), Vaswani et al. (attention), and Hoffmann et al. (Chinchilla). Identify the three weakest areas in the ML-fundamentals list and start there.
Week -5
Implementation drills
Implement attention from scratch in numpy. Implement a BPE tokenizer. Implement layer norm. Time yourself. Goal: each in under 30 minutes without reference.
Week -4
Systems and infra
Read the vLLM paper. Read FlashAttention v1 and v2. Skim a recent Anthropic or OpenAI engineering post on serving. Practice articulating the inference-serving stack out loud.
Week -3
Behavioral inventory
List 20 specific incidents from your career: 5 wins, 5 losses, 5 disagreements, 5 model failures. Write a 3-sentence summary of each. Practice telling them in 90 seconds each.
Week -2
Domain depth
Pick the domain (safety, infra, product) the role is biased toward. Read the most recent six months of the company's engineering blog. Read the most-cited paper in that domain from the past year.
Week -1
Mock loops
Do at least three full mock interviews with someone who has been in the role recently. Record yourself. Watch the recording, even though it's painful — verbal tics and pacing problems are the gap between strong and outstanding.
Day -1
Sleep
Stop preparing 24 hours before the loop. Cramming the night before is net negative. Read fiction. Sleep 8 hours. Eat normally.
After the loop
Sources
- [01]
Original Transformer architecture and the sqrt(d_k) attention scaling justification.
arxiv.org/abs/1706.03762
- [02]
Chinchilla compute-optimal scaling laws and the 6N FLOPs approximation.
arxiv.org/abs/2203.15556
- [03]
FlashAttention v1 IO-aware attention and the memory-hierarchy argument.
arxiv.org/abs/2205.14135
- [04]
FlashAttention v2 improvements in parallelism and work partitioning.
arxiv.org/abs/2307.08691
- [05]
vLLM and paged attention as the inference-serving primitive.
arxiv.org/abs/2309.06180
- [06]
LoRA low-rank adaptation for parameter-efficient fine-tuning.
arxiv.org/abs/2106.09685
- [07]
InstructGPT and the canonical RLHF training loop.
arxiv.org/abs/2203.02155
- [08]
Constitutional AI two-stage training and the principle-based feedback signal.
arxiv.org/abs/2212.08073
- [09]
DPO as a direct alternative to PPO-based RLHF.
arxiv.org/abs/2305.18290
- [10]
GRPO variant of policy optimization for LLM training (DeepSeekMath).
arxiv.org/abs/2402.03300
- [11]
Anthropic Sleeper Agents paper on deceptive alignment persistence.
arxiv.org/abs/2401.05566
- [12]
RoPE rotary position embeddings as replacement for sinusoidal.
arxiv.org/abs/2104.09864
- [13]
Schaeffer et al. critique of emergent capabilities as a measurement artifact.
arxiv.org/abs/2304.15004
- [14]
Lost in the Middle finding on long-context attention degradation.
arxiv.org/abs/2307.03172
- [15]
Power et al. grokking result on delayed generalization.
arxiv.org/abs/2201.02177
- [16]
Original RAG paper for retrieval-augmented generation.
arxiv.org/abs/2005.11401
- [17]
ReAct paper combining reasoning traces and tool use.
arxiv.org/abs/2210.03629
- [18]
Greshake et al. indirect prompt injection threat model.
arxiv.org/abs/2302.12173
- [19]
Anthropic mechanistic-interpretability work on induction heads and in-context learning.
transformer-circuits.pub/2022/in-context-learning-and-induction-heads
- [20]
Canonical structure for the ML interview question landscape across applied roles.
Chip Huyen · Machine Learning Interviews · O'Reilly 2024
- [21]
Bias-variance decomposition and curse-of-dimensionality reference.
Hastie, Tibshirani, Friedman · The Elements of Statistical Learning · 2nd ed.
- [22]
Foundational reference for MLE, MAP, and softmax cross-entropy derivations.
Bishop · Pattern Recognition and Machine Learning · 2006
- [23]
Canonical reference for KL divergence and entropy.
Cover and Thomas · Elements of Information Theory · 2nd ed.
- [24]
Published commitments on capability thresholds and evaluation requirements.
Anthropic · responsible-scaling-policy
- [25]
Speculative decoding via draft-and-verify with a smaller draft model.
arxiv.org/abs/2211.17192