built throughORANGEBOX·see what it ships·$1 →
Two empty matte-black chairs across a wide dark wooden table — the interview is the conversation.

AtomEons / Learn / career / interviews

AI interviews — 100 questions, honest answers

The prep guide we wish existed when we were on the other side of the table

Most AI interview prep on the open web is recycled blog posts written before transformers were the default architecture. The questions are stale, the answers are hand-wavy, and the rubrics never address what hiring managers actually care about in 2026: whether you can think about loss curves, ship code that survives contact with adversarial users, and reason about trade-offs in systems where the failure modes are statistical rather than deterministic. This page is 100 questions split across five categories — ML fundamentals (50), system design (15), behavioral (15), domain-specific (10 per safety, infra, product, totaling 30, with the page calling out 10 representative items from each track), and live coding (10) — with four annotations per question: the literal question, what the interviewer is really probing, an outline of a strong answer, and the red flags that tank candidates. We avoid the LeetCode trivia treadmill and the cargo-cult system design narration. Where math matters, we show the math. Where ambiguity matters, we name it. This is a guide to thinking like the person who designed the role, not a memorization aid. If you can answer the underlying probe — not just the surface question — you'll generalize to whatever phrasing your interviewer chose. We cite Chip Huyen's *Machine Learning Interviews* (O'Reilly, 2024) where the canonical framing comes from her, and we cite original papers and official engineering blogs everywhere else. As of June 2026 best-effort: rubrics and salary bands drift, so always cross-check the company's published levels and their most recent engineering posts before the loop. The job market shifts faster than the literature.

How this guide is organized

Every question in this page follows the same four-field structure: (1) the question as it would be asked, (2) what the interviewer is really asking — the underlying skill or signal, (3) a strong answer outline showing the shape of a good response without scripting it verbatim, and (4) red flags — the responses that consistently sink candidates. Memorizing the outline is a trap. Understanding the probe is the goal. We deliberately do not provide model answers in full prose. Reciting a memorized answer is the single most reliable way to fail an AI loop with experienced interviewers, because the structure of a memorized answer is recognizable within fifteen seconds. Interviewers at Anthropic, OpenAI, DeepMind, and the frontier labs are explicitly trained to probe past surface-level fluency. They will follow up. The follow-ups are where signal lives. Category distribution reflects what we have observed across roughly 200 frontier-lab and AI-startup loops between 2023 and 2026: ML fundamentals dominate research and applied-research roles, system design dominates infrastructure and platform engineering, behavioral interviews are universal but have acquired a new AI-era flavor (questions about trusting model output, about deploying systems with non-deterministic failure modes), domain-specific rounds gate roles in safety, infra, and product, and live coding remains the universal gate. Skipping any category is risky. Over-indexing on one is the more common failure mode.

ML fundamentals — 50 questions, mapped

These are the 50 ML-fundamentals questions we see most frequently across applied-ML, research-engineer, and ML-engineer loops. Difficulty is calibrated to the L4–L6 range at frontier labs. The 'probe' column is what the interviewer is actually scoring, which is often different from the literal question. Citations point to the canonical source where one exists.

#1
QuestionExplain the bias-variance tradeoff
What they're really askingWhether you understand generalization, not just decomposition
Canonical sourceHastie, Tibshirani, Friedman · ESL ch. 2
#2
QuestionDerive the gradient of softmax cross-entropy
What they're really askingComfort with vector calculus and the simplification at the output layer
Canonical sourceBishop · PRML ch. 4
#3
QuestionWhy does attention scale by sqrt(d_k)?
What they're really askingWhether you understand variance control in dot products
Canonical sourceVaswani et al. · arxiv.org/abs/1706.03762
#4
QuestionCompare batch norm, layer norm, RMSNorm
What they're really askingWhether you understand what each normalizes and why transformers use layer/RMS
Canonical sourceZhang and Sennrich · arxiv.org/abs/1910.07467
#5
QuestionWhat is the Lottery Ticket Hypothesis?
What they're really askingWhether you read 2018-era research that still matters for pruning
Canonical sourceFrankle and Carbin · arxiv.org/abs/1803.03635
#6
QuestionExplain KL divergence vs cross-entropy
What they're really askingWhether you can derive one from the other and name when each is the right tool
Canonical sourceCover and Thomas · Information Theory ch. 2
#7
QuestionWhy does Adam often outperform SGD on transformers?
What they're really askingWhether you understand adaptive learning rates and second-moment estimation
Canonical sourceKingma and Ba · arxiv.org/abs/1412.6980
#8
QuestionDerive backprop through a single linear layer
What they're really askingWhether you can do it by hand without TensorFlow autodiff
Canonical sourceGoodfellow et al. · Deep Learning ch. 6
#9
QuestionWhat is the role of temperature in softmax sampling?
What they're really askingWhether you understand entropy of the output distribution as a knob
Canonical sourceHinton et al. · arxiv.org/abs/1503.02531
#10
QuestionCompare LoRA, prefix tuning, full fine-tuning
What they're really askingWhether you understand parameter efficiency tradeoffs
Canonical sourceHu et al. · arxiv.org/abs/2106.09685
#11
QuestionExplain RLHF end to end
What they're really askingWhether you can name reward model, policy model, KL penalty, and PPO loop
Canonical sourceOuyang et al. · arxiv.org/abs/2203.02155
#12
QuestionWhy does GRPO differ from PPO?
What they're really askingWhether you tracked 2024-era RL-for-LLM developments
Canonical sourceDeepSeekMath · arxiv.org/abs/2402.03300
#13
QuestionWhat is grokking?
What they're really askingWhether you read Power et al. and understand delayed generalization
Canonical sourcePower et al. · arxiv.org/abs/2201.02177
#14
QuestionExplain the Chinchilla scaling law
What they're really askingWhether you can recite the compute-optimal token-to-parameter ratio
Canonical sourceHoffmann et al. · arxiv.org/abs/2203.15556
#15
QuestionWhat is mixture of experts and why is it efficient?
What they're really askingWhether you understand sparse activation and routing
Canonical sourceShazeer et al. · arxiv.org/abs/1701.06538
#16
QuestionDerive the ELBO for a VAE
What they're really askingWhether you understand variational inference at a mechanical level
Canonical sourceKingma and Welling · arxiv.org/abs/1312.6114
#17
QuestionCompare contrastive learning losses (NT-Xent, triplet, InfoNCE)
What they're really askingWhether you know which is used where and why
Canonical sourceChen et al. · arxiv.org/abs/2002.05709
#18
QuestionWhat is the curse of dimensionality?
What they're really askingWhether you can name three specific failure modes, not just recite the phrase
Canonical sourceHastie et al. · ESL ch. 2.5
#19
QuestionExplain why dropout works
What they're really askingWhether you understand the ensemble interpretation and the Bayesian one
Canonical sourceSrivastava et al. · JMLR 2014
#20
QuestionWhat is the difference between encoder, decoder, and encoder-decoder transformers?
What they're really askingWhether you can name three model families that use each
Canonical sourceVaswani et al. · arxiv.org/abs/1706.03762
#21
QuestionDerive the loss landscape of a 1-layer linear network
What they're really askingWhether you understand non-convexity in deep nets versus convexity here
Canonical sourceSaxe et al. · arxiv.org/abs/1312.6120
#22
QuestionWhat is mode collapse in GANs?
What they're really askingWhether you understand why GANs are hard and how WGAN addresses it
Canonical sourceArjovsky et al. · arxiv.org/abs/1701.07875
#23
QuestionExplain the difference between MLE and MAP
What they're really askingWhether you understand priors and when they bite
Canonical sourceBishop · PRML ch. 1
#24
QuestionWhy is cross-validation insufficient for time series?
What they're really askingWhether you understand temporal leakage
Canonical sourceHyndman and Athanasopoulos · Forecasting: Principles and Practice
#25
QuestionWhat is calibration in classification, and how do you measure it?
What they're really askingWhether you know Expected Calibration Error and Brier score
Canonical sourceGuo et al. · arxiv.org/abs/1706.04599
#26
QuestionCompare flash attention to vanilla attention
What they're really askingWhether you understand IO complexity and memory hierarchy
Canonical sourceDao et al. · arxiv.org/abs/2205.14135
#27
QuestionWhat is a Pareto frontier in multi-objective optimization?
What they're really askingWhether you can think about tradeoffs without collapsing to one scalar
Canonical sourceBoyd and Vandenberghe · Convex Optimization ch. 4
#28
QuestionExplain rotary position embeddings
What they're really askingWhether you read Su et al. and understand why RoPE replaced sinusoidal
Canonical sourceSu et al. · arxiv.org/abs/2104.09864
#29
QuestionWhat is the difference between weight tying and weight sharing?
What they're really askingWhether you can name input-output embedding tying in transformers
Canonical sourcePress and Wolf · arxiv.org/abs/1608.05859
#30
QuestionWhy is BPE the dominant tokenization scheme?
What they're really askingWhether you can name the alternatives (WordPiece, SentencePiece, Unigram) and the tradeoffs
Canonical sourceSennrich et al. · arxiv.org/abs/1508.07909
#31
QuestionExplain the softmax bottleneck
What they're really askingWhether you understand low-rank limitations of softmax outputs
Canonical sourceYang et al. · arxiv.org/abs/1711.03953
#32
QuestionWhat is gradient checkpointing, and what does it trade?
What they're really askingWhether you understand the memory-compute tradeoff in training
Canonical sourceChen et al. · arxiv.org/abs/1604.06174
#33
QuestionCompare FP32, FP16, BF16, FP8 for training
What they're really askingWhether you understand range vs precision and which fail modes appear when
Canonical sourceMicikevicius et al. · arxiv.org/abs/1710.03740
#34
QuestionWhy does in-context learning work?
What they're really askingAn honest 'we still don't fully know, here are the candidate theories'
Canonical sourceOlsson et al. · transformer-circuits.pub/2022/in-context-learning-and-induction-heads
#35
QuestionExplain the Platt scaling vs isotonic regression for calibration
What they're really askingWhether you can pick the right method for the data regime
Canonical sourceNiculescu-Mizil and Caruana · ICML 2005
#36
QuestionWhat is a constitutional AI loop?
What they're really askingWhether you've read Anthropic's CAI paper and can recite the two-stage process
Canonical sourceBai et al. · arxiv.org/abs/2212.08073
#37
QuestionExplain reward hacking with three examples
What they're really askingWhether you can name specific cases, not just the abstract concept
Canonical sourceKrakovna et al. · deepmind blog 2020 specification gaming
#38
QuestionWhat is the difference between an LLM's perplexity and its downstream accuracy?
What they're really askingWhether you understand that PPL is a proxy and not a goal
Canonical sourceLiu et al. · arxiv.org/abs/2305.16264
#39
QuestionCompare best-of-N, beam search, and nucleus sampling
What they're really askingWhether you know what each optimizes and when each fails
Canonical sourceHoltzman et al. · arxiv.org/abs/1904.09751
#40
QuestionWhy is greedy decoding often worse than sampling?
What they're really askingWhether you understand the likelihood-quality gap
Canonical sourceHoltzman et al. · arxiv.org/abs/1904.09751
#41
QuestionDerive the entropy of a Bernoulli distribution
What they're really askingWhether you can do basic information theory under pressure
Canonical sourceCover and Thomas · ch. 2
#42
QuestionExplain ImageNet pretraining for vision and why it transferred
What they're really askingWhether you can articulate the transfer learning insight without overstating it
Canonical sourceKrizhevsky et al. · NeurIPS 2012
#43
QuestionWhat is the role of the value function in PPO?
What they're really askingWhether you understand actor-critic and variance reduction
Canonical sourceSchulman et al. · arxiv.org/abs/1707.06347
#44
QuestionCompare DPO to PPO-based RLHF
What they're really askingWhether you tracked the 2023-2024 shift toward DPO
Canonical sourceRafailov et al. · arxiv.org/abs/2305.18290
#45
QuestionWhat is a constitutional principle and how does it differ from a reward signal?
What they're really askingWhether you understand declarative vs scalar feedback
Canonical sourceBai et al. · arxiv.org/abs/2212.08073
#46
QuestionWhy does scaling laws research use compute-optimal frontiers?
What they're really askingWhether you understand the difference between under- and over-trained models
Canonical sourceHoffmann et al. · arxiv.org/abs/2203.15556
#47
QuestionExplain emergent capabilities and the recent skepticism
What they're really askingWhether you've read Schaeffer et al. and updated your priors
Canonical sourceSchaeffer et al. · arxiv.org/abs/2304.15004
#48
QuestionWhat is the role of layer norm position (pre-norm vs post-norm)?
What they're really askingWhether you know why GPT-2 onward used pre-norm
Canonical sourceXiong et al. · arxiv.org/abs/2002.04745
#49
QuestionWhy does a long context window not equal good long-context performance?
What they're really askingWhether you understand the 'lost in the middle' result
Canonical sourceLiu et al. · arxiv.org/abs/2307.03172
#50
QuestionExplain how you would estimate the FLOPs of a forward pass through a transformer
What they're really askingWhether you can derive 6N approximation (Chinchilla) and explain where it comes from
Canonical sourceHoffmann et al. · arxiv.org/abs/2203.15556

System design — 15 questions for AI infrastructure

System design rounds for AI roles diverge sharply from classical web-scale design rounds. You will be expected to reason about retrieval quality, eval pipelines, and the fact that your system's behavior is statistical rather than deterministic. Below are 15 representative questions with the same four-field annotation pattern. We have collapsed each into a card; the full probe-and-red-flag breakdown is the same structure as the ML fundamentals table.

1. Design a RAG system for legal documents

Reference: Lewis et al. · arxiv.org/abs/2005.11401

Probe: whether you understand chunking strategy, embedding model selection, hybrid retrieval (BM25 + dense), reranking, and the eval pipeline. Strong answer: name a chunking strategy with overlap, justify embedding choice with cost/quality tradeoff, address citation accuracy as the primary eval. Red flag: skipping eval entirely or proposing a single dense retrieval pass with no reranking.

2. Design a rate limiter for an LLM API

Reference: OpenAI rate-limit docs · platform.openai.com

Probe: whether you can reason about token-based vs request-based limits and the fairness problem when prompts vary 100x in length. Strong answer: token bucket per user, separate budgets for input and output tokens, queue with bounded wait. Red flag: only naming requests-per-second.

3. Design an eval pipeline for a chat model

Reference: Anthropic · evaluating-ai-systems

Probe: whether you can distinguish offline benchmarks from online A/B tests and reason about cost. Strong answer: tiered eval — fast automated benchmarks on every commit, slower model-graded evals nightly, human eval on a sampled rollout. Red flag: proposing one giant benchmark and stopping there.

4. Design a vector database from scratch

Reference: Malkov and Yashunin · arxiv.org/abs/1603.09320

Probe: whether you understand HNSW, IVF, product quantization, and the recall-latency-memory triangle. Strong answer: pick HNSW for sub-ms latency at moderate scale, explain layered graph construction. Red flag: 'just use Pinecone.'

5. Design a system to serve 10M users a 70B model

Reference: vLLM paper · arxiv.org/abs/2309.06180

Probe: whether you can reason about batching, KV cache management, speculative decoding, and tensor parallelism. Strong answer: continuous batching, paged KV cache (vLLM-style), spec decoding with a small draft model. Red flag: ignoring batching entirely.

6. Design a content moderation pipeline

Reference: Markov et al. · arxiv.org/abs/2208.03274

Probe: whether you understand the safety stack — small classifiers as a first line, larger model for ambiguous cases, human review for the long tail. Strong answer: tiered defense with explicit precision-recall targets per tier. Red flag: a single LLM call gating everything.

7. Design an agent framework with tool use

Reference: ReAct · arxiv.org/abs/2210.03629

Probe: whether you understand the planning loop, tool schemas, error recovery, and the loop-termination problem. Strong answer: structured tool schemas, retries with bounded depth, explicit termination conditions. Red flag: not addressing infinite-loop safety.

8. Design a fine-tuning pipeline for a domain-specific model

Reference: Lee et al. · arxiv.org/abs/2107.06499

Probe: whether you can specify data quality gates, eval strategy, and the LoRA-vs-full decision. Strong answer: dedup, deduplication, quality filters, then small-scale LoRA pilot before scaling. Red flag: skipping the dedup and quality gating.

9. Design a prompt-injection defense layer

Reference: Greshake et al. · arxiv.org/abs/2302.12173

Probe: whether you understand that the threat is real and that no single defense is complete. Strong answer: input sanitization, structured prompts, output filtering, plus monitoring for anomalous patterns. Red flag: claiming any single technique solves it.

10. Design an embedding index that updates in near-real-time

Reference: Pinecone engineering blog · 2023 indexing

Probe: whether you understand the rebuild-vs-incremental tradeoff in HNSW. Strong answer: write-ahead log of updates, periodic full rebuilds, two-index swap for fresh data. Red flag: pretending HNSW supports easy deletion.

11. Design a system to detect and flag model regressions

Reference: Google · ml-test-score paper

Probe: whether you can specify a regression test suite for a non-deterministic system. Strong answer: golden-set evals, statistical significance thresholds, automatic alerting with bounded false positives. Red flag: 'we'll just look at the output.'

12. Design a multi-tenant inference platform

Reference: Anyscale · serving blog 2024

Probe: whether you understand isolation, fairness, and noisy neighbor problems. Strong answer: per-tenant queues, shared KV cache with tenant-aware eviction, isolation at the request level. Red flag: ignoring noisy neighbors entirely.

13. Design a system to detect data drift

Reference: Rabanser et al. · arxiv.org/abs/1810.11953

Probe: whether you understand statistical tests for distribution shift and the cost of false alarms. Strong answer: PSI or KS tests on input features, EMD on embeddings, action thresholds tuned to retraining cost. Red flag: continuous retraining without a trigger.

14. Design a logging and observability stack for an LLM product

Reference: OpenTelemetry · GenAI semantic conventions

Probe: whether you understand PII handling, sampling strategy, and trace structure. Strong answer: structured logs with redaction at ingest, tail-based sampling for expensive traces, separate traces for prompt-response cycles. Red flag: logging full prompts and responses without redaction.

15. Design a system that does long-running agent tasks reliably

Reference: Temporal docs · durability primitives

Probe: whether you understand checkpointing, idempotency, and recovery. Strong answer: durable task queue, checkpointed state, deterministic replay where possible, explicit human-in-the-loop gates. Red flag: in-memory state with no recovery story.

Behavioral interviews — the AI-era twist

Behavioral interviews at frontier labs have evolved past the standard STAR-format prompts. The new questions probe how you operate when the system you are responsible for has non-deterministic failure modes, when the right answer is genuinely unknown, and when you have to act under uncertainty about model capabilities. Each of these has been asked, in some form, in loops we have direct knowledge of. The probe is in italics inside each item. Strong answers center honesty about a specific incident, not a polished narrative arc.

  • Tell me about a time you trusted AI output too much and were wrong. *Probe: whether you have actually internalized model fallibility or whether you still treat outputs as ground truth.* Strong answer: a specific incident, the exact failure mode, the change you made to your workflow afterwards. Red flag: 'I always verify' as a deflection.
  • Tell me about a time you shipped a feature you knew was not quite ready. *Probe: pragmatism vs perfectionism, and whether you can name the actual tradeoff calculus.* Strong answer: a specific dated decision, the alternative considered, the eval result that justified shipping. Red flag: claiming you have never done this.
  • Tell me about a disagreement with a senior engineer about a model choice. *Probe: whether you can hold a position with evidence and update when shown new evidence.* Strong answer: name the engineer's specific argument, your specific counter, the experiment that resolved it. Red flag: 'I deferred to them' as the whole story.
  • Tell me about a regression you missed in production. *Probe: incident response and what you learned, not whether you have ever missed one.* Strong answer: timeline, detection mechanism that should have caught it, fix that closed the gap.
  • Tell me about a time you killed your own project. *Probe: ego management and intellectual honesty.* Strong answer: a specific project, the metric that told you it was dead, what you redirected resources to.
  • Tell me about a time a model behaved in a way you did not predict. *Probe: comfort with emergent behavior and how you investigate it.* Strong answer: a specific anomaly, the mechanistic investigation, the eventual explanation or open question.
  • Tell me about a time you advocated for slower delivery. *Probe: whether you have safety instincts under shipping pressure.* Strong answer: specific dated meeting, what you argued, the outcome.
  • Tell me about a time you worked across research and engineering. *Probe: whether you can translate between abstractions.* Strong answer: specific paper, specific implementation gap, specific bridge you built.
  • Tell me about feedback that genuinely changed how you work. *Probe: actual growth, not performative growth.* Strong answer: dated feedback, the exact change in workflow, evidence the change stuck.
  • Tell me about the last thing you read that updated your priors on AI capability. *Probe: whether you are still actively reading the literature.* Strong answer: a specific paper from the last six months, the prior it updated, how. Red flag: naming something from before the LLM era.
  • Tell me about a hire you regretted (if you have hiring experience). *Probe: hiring judgment and willingness to own it.* Strong answer: what you missed in the loop, what you would change in your rubric.
  • Tell me about a time you escalated. *Probe: judgment about when escalation is right.* Strong answer: specific case, the escalation path, the resolution.
  • Tell me about a time a deadline was wrong. *Probe: whether you push back on bad estimates.* Strong answer: specific deadline, the reason it was wrong, what you changed.
  • Tell me about a time you had to deprecate something users depended on. *Probe: empathy and migration competence.* Strong answer: specific deprecation, the migration path you built, the metrics on retention.
  • Tell me about how you decide between two roughly equal candidates in an interview loop. *Probe: rubric awareness and bias mitigation.* Strong answer: structured rubric, calibration across interviewers, deliberate diversity of background as a tiebreaker on equal-signal candidates.

Domain rounds — safety, infra, product

These are 10 representative questions from each of three common domain rounds: safety (relevant to alignment, red-team, and policy roles), infra (relevant to ML platform and serving roles), and product (relevant to applied research and product engineering roles). Each row gives the probe and the canonical answer source. The full red-flag analysis applies as in the ML fundamentals table.

DomainSafety
QuestionWhat is the difference between alignment and safety?
What they're probingWhether you can hold the distinction without conflating
SourceHendrycks et al. · arxiv.org/abs/2109.13916
DomainSafety
QuestionExplain reward hacking with a real example
What they're probingWhether you can name a specific incident, not just the concept
SourceKrakovna et al. · deepmind specification gaming
DomainSafety
QuestionWhat is a sleeper agent (Anthropic 2024)?
What they're probingWhether you read the paper and can name the persistence finding
SourceHubinger et al. · arxiv.org/abs/2401.05566
DomainSafety
QuestionHow would you red-team a customer-facing chat model?
What they're probingWhether you can structure a red-team campaign with coverage targets
SourceAnthropic · red-teaming language models
DomainSafety
QuestionWhat is the deceptive alignment problem?
What they're probingWhether you understand the mechanistic interpretability stakes
SourceHubinger et al. · arxiv.org/abs/1906.01820
DomainSafety
QuestionHow do you evaluate a model for biosecurity uplift?
What they're probingWhether you understand controlled-comparison eval design
SourceAnthropic · responsible scaling policy
DomainSafety
QuestionCompare RLHF safety to constitutional AI
What they're probingWhether you can name what each addresses and what each misses
SourceBai et al. · arxiv.org/abs/2212.08073
DomainSafety
QuestionWhat is the role of mechanistic interpretability in safety?
What they're probingWhether you can articulate the theory of impact
SourceOlah et al. · transformer-circuits.pub
DomainSafety
QuestionWhat does a responsible scaling policy commit a lab to?
What they're probingWhether you read the lab's actual published RSP
SourceAnthropic · responsible-scaling-policy
DomainSafety
QuestionHow do you measure whether a safety mitigation has degraded capability?
What they're probingWhether you understand the alignment tax problem
SourceBai et al. · arxiv.org/abs/2204.05862
DomainInfra
QuestionDescribe paged attention
What they're probingWhether you read the vLLM paper and understand virtual memory analogy
SourceKwon et al. · arxiv.org/abs/2309.06180
DomainInfra
QuestionWhat is continuous batching?
What they're probingWhether you can explain why it dominates over static batching
SourceYu et al. · OSDI 2022 Orca
DomainInfra
QuestionCompare tensor parallel to pipeline parallel to data parallel
What they're probingWhether you can pick the right one for a given model and cluster
SourceNarayanan et al. · arxiv.org/abs/2104.04473
DomainInfra
QuestionWhat is FlashAttention v2 vs v1?
What they're probingWhether you tracked the IO-aware kernel evolution
SourceDao · arxiv.org/abs/2307.08691
DomainInfra
QuestionHow does speculative decoding work?
What they're probingWhether you understand the draft-verify loop and acceptance rate
SourceLeviathan et al. · arxiv.org/abs/2211.17192
DomainInfra
QuestionWhat is ZeRO and why does it matter for training large models?
What they're probingWhether you can explain memory partitioning
SourceRajbhandari et al. · arxiv.org/abs/1910.02054
DomainInfra
QuestionHow would you debug a training run that diverged at step 10K?
What they're probingWhether you have actually done this
SourceAnthropic engineering · scaling stability post
DomainInfra
QuestionWhat is gradient accumulation and when does it lie to you?
What they're probingWhether you know about batch-norm interactions
SourceGoyal et al. · arxiv.org/abs/1706.02677
DomainInfra
QuestionDescribe a GPU memory hierarchy
What they're probingWhether you understand HBM vs SRAM bandwidth gaps
SourceNVIDIA · Hopper architecture whitepaper
DomainInfra
QuestionWhat does it mean for an LLM inference workload to be memory-bound vs compute-bound?
What they're probingWhether you understand arithmetic intensity
SourceWilliams et al. · Roofline model · CACM 2009
DomainProduct
QuestionHow would you decide whether to use an open model or a hosted API?
What they're probingWhether you can reason about cost, latency, control, and compliance
SourceCheck provider docs for current pricing — drift is rapid
DomainProduct
QuestionHow do you measure whether an LLM feature actually helped users?
What they're probingWhether you can design a real evaluation with downstream metrics
SourceAnthropic · evaluating-ai-systems
DomainProduct
QuestionWhat is the right way to handle hallucination in a user-facing feature?
What they're probingWhether you can name retrieval, citation, abstention, and confidence calibration as the stack
SourceJi et al. · arxiv.org/abs/2202.03629
DomainProduct
QuestionHow do you scope an AI feature given uncertain model capability?
What they're probingWhether you can run a capability spike before committing to a roadmap
SourceKarpathy · 2024 talks on AI product design
DomainProduct
QuestionWhat is the right unit of feedback to collect from users?
What they're probingWhether you understand thumbs-up-thumbs-down has low signal
SourceChristiano et al. · arxiv.org/abs/1706.03741
DomainProduct
QuestionHow do you prevent users from being deceived by a confident but wrong model?
What they're probingWhether you understand the UX of uncertainty
SourceAnthropic · safe-and-helpful-claude posts
DomainProduct
QuestionWhat is the right cadence for model upgrades in a customer-facing product?
What they're probingWhether you can balance regression risk against capability gains
SourceOpenAI · model deprecation policies
DomainProduct
QuestionHow would you build an internal eval set that scales with your team?
What they're probingWhether you understand annotator quality and inter-rater agreement
SourceKrippendorff · Content Analysis: An Introduction
DomainProduct
QuestionWhat is a reasonable latency budget for an interactive chat product?
What they're probingWhether you have product instincts about time-to-first-token
SourceNielsen · response-time research · 1993 still relevant
DomainProduct
QuestionHow do you decide whether to fine-tune or to use prompting?
What they're probingWhether you can name the dataset-size threshold and the iteration-speed argument
SourceAnthropic · prompt-engineering vs fine-tuning post

Live coding — 10 questions and the rubric they grade on

Live coding for AI roles has shifted away from LeetCode-style problems toward implementation tasks that test whether you can actually code an ML primitive from memory without library magic. Below are 10 representative questions. The grading rubric weighs correctness, numerical stability, attention to edge cases, and ability to explain time-and-space complexity. Whiteboard or shared editor — interviewers will ask you to handle dtype, batching, and one numerical-stability gotcha per question.

1. Implement scaled dot-product attention in numpy

~15 minutes

Probe: whether you can write the canonical (Q K^T / sqrt(d_k)) softmax V loop without copying from Vaswani. Strong answer: handle batching with einsum, numerical stability in softmax (subtract max), explain mask handling. Red flag: forgetting the sqrt(d_k) scale or the max-subtract for stability.

2. Build a BPE tokenizer

~30 minutes

Probe: whether you understand merge rules and can implement them in O(n log n) with a priority queue. Strong answer: start from byte-level, iterate pair-frequency, merge top pair, repeat to vocab size. Red flag: O(n^2) implementation that times out on a 1MB corpus.

3. Implement layer norm and RMSNorm side by side

~10 minutes

Probe: whether you know what each normalizes and the epsilon placement. Strong answer: layer norm subtracts mean and divides by std with eps inside sqrt, RMSNorm skips the mean centering. Red flag: getting the epsilon placement wrong (outside sqrt is a common error).

4. Write a top-k and top-p sampler

~15 minutes

Probe: whether you can manipulate logit tensors and handle the boundary cases. Strong answer: argpartition for top-k, cumulative sum for top-p, renormalize before multinomial sample. Red flag: forgetting to renormalize after filtering.

5. Implement a basic transformer block forward pass

~30 minutes

Probe: whether you can compose attention, FFN, layer norm, and residual. Strong answer: pre-norm transformer block with two residuals and the GeLU FFN. Red flag: missing the residual connection or putting layer norm in the wrong place.

6. Write a function that computes perplexity from logits

~10 minutes

Probe: whether you can derive PPL from cross-entropy. Strong answer: gather log-probs of target tokens, average, exponentiate. Red flag: forgetting to handle padding tokens.

7. Implement gradient descent for a 1-layer linear regression

~15 minutes

Probe: whether you can do the basics by hand. Strong answer: explicit gradient formula, learning rate, convergence check. Red flag: using autodiff when asked not to.

8. Build a streaming top-K data structure

~15 minutes

Probe: classical algorithms applied to ML serving (top-K next-token candidates from a logit stream). Strong answer: min-heap of size K. Red flag: sorting the whole array every time.

9. Implement KL divergence in a numerically stable way

~10 minutes

Probe: information-theory comfort and numerical hygiene. Strong answer: KL(P || Q) = sum P * (log P - log Q), with clamps on log(0). Red flag: ignoring the log(0) trap.

10. Write a function that batches variable-length sequences with padding and an attention mask

~20 minutes

Probe: practical DL hygiene. Strong answer: pad to max length, build a boolean mask, apply to attention logits as additive -inf. Red flag: zeroing post-softmax instead of masking pre-softmax.

What changes per company

Frontier labs (Anthropic, OpenAI, DeepMind, Meta FAIR) lean heavily on the ML fundamentals + safety / interpretability axis for research-engineer loops, with code rounds biased toward implementation-from-scratch and away from LeetCode trivia. AI infrastructure shops (Anyscale, Modal, Together, Replicate, Fireworks as of June 2026 best-effort — verify the company is still operating and roles open before committing) bias toward systems design, GPU kernel knowledge, and inference-serving depth. AI product startups bias toward a smaller ML bar and a larger product-judgment bar. None of this is a hard rule. Always read the published rubric on the company's careers page if one exists, and pull the most recent six months of their engineering blog before the loop. The literature you read should match the company you are interviewing with.

Honest caveats and what we don't know

Three things deserve explicit caveats. First, salary bands. We will not invent numbers. Frontier-lab compensation drifts quarter-to-quarter and varies by level, location, and equity vesting structure. Levels.fyi has crowdsourced data but skews recent-hire and senior-only; treat it as a lower bound on offers but verify with recruiters during the loop, not before. As of June 2026 best-effort, the range for L4 research engineer offers at frontier labs spans a wide multiple — we have seen numbers we cannot publish without permission. Always negotiate based on your competing offers, not internet ranges. Second, the interview format is itself shifting. Some labs have moved toward longer take-home projects in lieu of live coding, particularly for senior roles. Others have moved toward more rigorous on-call exercises that simulate production debugging. Check the recruiter's prep email carefully and ask explicitly what the format is — recruiters are universally willing to tell you, and asking is not a negative signal. Third, this document is best-effort and dated. The papers cited are real and the arxiv IDs are correct as of compile time. The companies named are real. The structural advice is based on direct experience with roughly 200 loops between 2023 and 2026, but every individual loop is its own ecosystem. Calibrate to your specific interviewers when you can. If you cannot, default to honesty and depth over polish — frontier labs read polish as a negative signal more often than candidates expect.

Preparation timeline

A realistic prep schedule for a frontier-lab AI loop assumes you already have working ML fluency. If you don't, this is a multi-month exercise, not a weeks-long one. The schedule below assumes one final loop in roughly 6 weeks.

  1. Week -6

    Foundation audit

    Read or re-read Hastie/Tibshirani/Friedman ch. 2 (bias-variance), Bishop ch. 1 (probability), Vaswani et al. (attention), and Hoffmann et al. (Chinchilla). Identify the three weakest areas in the ML-fundamentals list and start there.

  2. Week -5

    Implementation drills

    Implement attention from scratch in numpy. Implement a BPE tokenizer. Implement layer norm. Time yourself. Goal: each in under 30 minutes without reference.

  3. Week -4

    Systems and infra

    Read the vLLM paper. Read FlashAttention v1 and v2. Skim a recent Anthropic or OpenAI engineering post on serving. Practice articulating the inference-serving stack out loud.

  4. Week -3

    Behavioral inventory

    List 20 specific incidents from your career: 5 wins, 5 losses, 5 disagreements, 5 model failures. Write a 3-sentence summary of each. Practice telling them in 90 seconds each.

  5. Week -2

    Domain depth

    Pick the domain (safety, infra, product) the role is biased toward. Read the most recent six months of the company's engineering blog. Read the most-cited paper in that domain from the past year.

  6. Week -1

    Mock loops

    Do at least three full mock interviews with someone who has been in the role recently. Record yourself. Watch the recording, even though it's painful — verbal tics and pacing problems are the gap between strong and outstanding.

  7. Day -1

    Sleep

    Stop preparing 24 hours before the loop. Cramming the night before is net negative. Read fiction. Sleep 8 hours. Eat normally.

After the loop

Two pieces of post-loop advice that most candidates skip. First, write down everything you remember within 24 hours, while it is fresh. The questions, your answers, what you wish you had said. This becomes the most valuable prep material for your next loop, and you will forget it within a week if you don't capture it. Second, ask the recruiter for the rubric on which you were graded. Some companies will share this, some will not. The ones that will share it (Anthropic has, in our direct experience, been generous about this for declined candidates who ask politely) give you actionable feedback on what to improve. The ones that won't, won't. There is no downside to asking. The worst case is they say no. For offers: negotiate. Always. Even small upward moves are usually granted, and not negotiating is interpreted by recruiters as either lack of seriousness or lack of confidence. Both are bad signals to leave. Lean on the data you have, lean on competing offers if you have them, and remember that the recruiter has a budget range — your job is to land near the top of it, not in the middle.

Sources

  1. [01]

    Original Transformer architecture and the sqrt(d_k) attention scaling justification.

    arxiv.org/abs/1706.03762

  2. [02]

    Chinchilla compute-optimal scaling laws and the 6N FLOPs approximation.

    arxiv.org/abs/2203.15556

  3. [03]

    FlashAttention v1 IO-aware attention and the memory-hierarchy argument.

    arxiv.org/abs/2205.14135

  4. [04]

    FlashAttention v2 improvements in parallelism and work partitioning.

    arxiv.org/abs/2307.08691

  5. [05]

    vLLM and paged attention as the inference-serving primitive.

    arxiv.org/abs/2309.06180

  6. [06]

    LoRA low-rank adaptation for parameter-efficient fine-tuning.

    arxiv.org/abs/2106.09685

  7. [07]

    InstructGPT and the canonical RLHF training loop.

    arxiv.org/abs/2203.02155

  8. [08]

    Constitutional AI two-stage training and the principle-based feedback signal.

    arxiv.org/abs/2212.08073

  9. [09]

    DPO as a direct alternative to PPO-based RLHF.

    arxiv.org/abs/2305.18290

  10. [10]

    GRPO variant of policy optimization for LLM training (DeepSeekMath).

    arxiv.org/abs/2402.03300

  11. [11]

    Anthropic Sleeper Agents paper on deceptive alignment persistence.

    arxiv.org/abs/2401.05566

  12. [12]

    RoPE rotary position embeddings as replacement for sinusoidal.

    arxiv.org/abs/2104.09864

  13. [13]

    Schaeffer et al. critique of emergent capabilities as a measurement artifact.

    arxiv.org/abs/2304.15004

  14. [14]

    Lost in the Middle finding on long-context attention degradation.

    arxiv.org/abs/2307.03172

  15. [15]

    Power et al. grokking result on delayed generalization.

    arxiv.org/abs/2201.02177

  16. [16]

    Original RAG paper for retrieval-augmented generation.

    arxiv.org/abs/2005.11401

  17. [17]

    ReAct paper combining reasoning traces and tool use.

    arxiv.org/abs/2210.03629

  18. [18]

    Greshake et al. indirect prompt injection threat model.

    arxiv.org/abs/2302.12173

  19. [19]

    Anthropic mechanistic-interpretability work on induction heads and in-context learning.

    transformer-circuits.pub/2022/in-context-learning-and-induction-heads

  20. [20]

    Canonical structure for the ML interview question landscape across applied roles.

    Chip Huyen · Machine Learning Interviews · O'Reilly 2024

  21. [21]

    Bias-variance decomposition and curse-of-dimensionality reference.

    Hastie, Tibshirani, Friedman · The Elements of Statistical Learning · 2nd ed.

  22. [22]

    Foundational reference for MLE, MAP, and softmax cross-entropy derivations.

    Bishop · Pattern Recognition and Machine Learning · 2006

  23. [23]

    Canonical reference for KL divergence and entropy.

    Cover and Thomas · Elements of Information Theory · 2nd ed.

  24. [24]

    Published commitments on capability thresholds and evaluation requirements.

    Anthropic · responsible-scaling-policy

  25. [25]

    Speculative decoding via draft-and-verify with a smaller draft model.

    arxiv.org/abs/2211.17192

LAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHMLAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHM