built throughORANGEBOX·see what it ships·$1 →

AtomEons / Learn / Deep / Capability Evaluation

::deep-dive

Capability Evaluation

Benchmarks, evals, and the science of measuring what frontier models can actually do

Evaluation is the bottleneck of frontier AI research. Without good evaluations you cannot know if a model is improving, you cannot compare alternative approaches, you cannot identify safety-relevant capabilities, and you cannot make grounded claims about progress. The field has accumulated dozens of standard benchmarks — MMLU (Hendrycks et al., Measuring Massive Multitask Language Understanding, 2020) for broad knowledge, BIG-bench (BIG-bench Collaboration, 2022) for diverse reasoning tasks, HumanEval (Chen et al., 2021) for code generation, MATH (Hendrycks et al., 2021) for mathematical reasoning, GPQA (Rein et al., 2023) for graduate-level science questions, ARC-AGI (Chollet, 2019) for abstract reasoning, SWE-bench (Jimenez et al., 2023) for real-world software engineering, and the HELM holistic evaluation framework (Liang et al., Stanford 2022). On the safety and capability side, METR's (formerly ARC Evals) work on autonomous task evaluation, model-organism-of-misalignment experiments, and dangerous-capability evaluations (cyber, bio, autonomous replication) defines the modern frontier-lab evaluation methodology. A doctorate-grade learner needs to understand: what each major benchmark actually measures (and what it does not); the methodology of constructing a benchmark (the inter-annotator agreement, the contamination concerns, the distribution-mismatch concerns); the difference between reference-based metrics (BLEU, ROUGE, F1) and model-graded metrics (LLM-as-judge); the saturation problem (every major benchmark eventually saturates and stops differentiating models); the contamination problem (training data leakage into evaluation sets) and how it is detected and mitigated; and the elicitation problem (a model's capability is not what it does by default, but what can be elicited from it with the best prompting, scaffolding, and fine-tuning). The Inspect framework (UK AISI) and the OpenAI Evals framework are the canonical evaluation infrastructures. METR's autonomy and uplift evaluations are the canonical modern dangerous-capability evals. By the end of this path you should be able to read a benchmark paper critically, design an evaluation for a novel capability, and recognize the difference between a saturated benchmark and a benchmark that still has signal.

::reading path · in order

  1. ::01 · paper

    ~4h

    Measuring Massive Multitask Language Understanding — Hendrycks, Burns, Basart, Zou, Mazeika, Song, Steinhardt (MMLU paper, 2020)

    The canonical broad-knowledge benchmark. Read for the construction methodology, and for understanding what it does not measure.

  2. ::02 · paper

    ~8h

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models — BIG-bench Collaboration (2022)

    Two hundred and four diverse tasks. Read the paper and inspect the task breakdown to develop taste for what makes a good eval.

  3. ::03 · paper

    ~3h

    Evaluating Large Language Models Trained on Code — Chen et al. (HumanEval paper, OpenAI 2021)

    The original code-generation benchmark. Foundational despite being saturated by frontier models.

  4. ::04 · paper

    ~3h

    Measuring Mathematical Problem Solving With the MATH Dataset — Hendrycks et al. (2021)

    Mathematical reasoning benchmark. Still useful pedagogically even as frontier models exceed strong-human performance.

  5. ::05 · paper

    ~4h

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark — Rein, Hou, Stickland, Petty, Pang, Dirani, Michael, Bowman (2023)

    Hard graduate-level science questions specifically designed to resist saturation and Google searches.

  6. ::06 · paper

    ~12h

    Holistic Evaluation of Language Models — Liang et al. (HELM paper, Stanford CRFM 2022)

    The Stanford holistic evaluation framework. Long paper; read for the methodology and the multi-axis framing of evaluation.

  7. ::07 · paper

    ~5h

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan (2023)

    Real-world software engineering benchmark. The new gold standard for agentic capability evaluation.

  8. ::08 · paper

    ~6h

    ARC-AGI — On the Measure of Intelligence and the ARC challenge (Chollet, 2019)

    Chollet's framework for evaluating general intelligence. Read both the paper and the ARC website.

  9. ::09 · blog

    ~8h

    METR — Evaluating Frontier AI R&D Capabilities of Language Model Agents (metr.org research)

    Modern frontier-lab dangerous-capability evaluation methodology. Read the public reports.

  10. ::10 · blog

    ~3h

    Anthropic — Responsible Scaling Policy (anthropic.com)

    How a frontier lab formally ties evaluations to deployment decisions. Useful as policy-eval interface.

  11. ::11 · code

    ~15h

    Inspect framework — UK AI Safety Institute (inspect.ai-safety-institute.org.uk)

    The most modern evaluation framework. Read the docs, then build a small custom eval.

  12. ::12 · code

    ~10h

    OpenAI Evals (github.com/openai/evals)

    Open-source eval framework with many example evals. Useful both as tool and as corpus of evaluation patterns.

::exercises · build · derive · reproduce

  1. 01Reproduce an MMLU score for a small open model. Then introduce a small amount of test contamination and observe the score shift.
  2. 02Build a custom eval in Inspect (or OpenAI Evals) for a capability you care about. Include adversarial and out-of-distribution test cases.
  3. 03Implement LLM-as-judge for a subjective task and validate against human ratings on at least 50 examples.
  4. 04Read the GPQA paper and inspect the diamond subset. Attempt several questions yourself before reading the answers.
  5. 05Audit a benchmark you care about for training-data contamination. Document your methodology and findings.
  6. 06Design a dangerous-capability evaluation for one specific capability (e.g., autonomous replication of a software project). Justify the threshold.

::milestones · observable

  • You can construct a benchmark from scratch.
  • You can identify when a benchmark has saturated.
  • You can detect training data contamination in an eval.
  • You can defend or critique an evaluation methodology.
  • You have actually run an eval against a real model and reproduced a published number.
LAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHMLAB · ATOMEONS · MARCO ISLAND FLÆONS RESEARCH · 12 PAPERS · CC-BY 4.0ORANGEBOX v1.0.0-beta · TURBO-OPTIMIZE CLAUDE · SHIPPED 2026-05-30B00KMAKR v3.2.0 · AI PUBLISHING COCKPIT · MAC + WINDOWSFREE LAUNCH WEEK · ENDS JUNE 6 · §4A NO-SAAS LOCKFOUNDER'S VIEW · NEXT BROADCAST IN ...CITE THE WORK · FORWARD THE LINK · NO ALGORITHM