AtomEons / Atlas / Synthetic data — when AI trains AI

How models learned to teach the next generation, and why textbooks beat the open web

Synthetic data — when AI trains AI

The internet ran out. Labelers got expensive. So researchers asked the obvious question — can models generate their own training data?

Self-Instruct — the spark

The seminal paper was Yizhong Wang et al.'s **Self-Instruct** (University of Washington, late 2022). The pipeline was almost embarrassingly simple. Start with ~175 seed instructions written by humans. Prompt GPT-3 to generate similar instructions. Filter for diversity and quality. Then prompt GPT-3 again to generate inputs and outputs for each instruction. You end up with ~52,000 instruction-following examples for almost no human labor. The kicker — fine-tuning the base GPT-3 on this synthetic corpus produced a model (called GPT-3 Self-Instruct) that performed within 5 percentage points of InstructGPT, which had been trained on hand-written instructions and expensive human feedback. The cost difference was something like 1000x. Stanford's **Alpaca** project (March 2023) productized the recipe. They took LLaMA-7B, fine-tuned it on 52K Self-Instruct examples generated by `text-davinci-003`, and produced a model that hobbyists could run locally and that behaved suspiciously like ChatGPT. The training cost was under $600. That was the moment everyone realized the labeling economy was about to invert.

Phi — the textbook-quality bet

Microsoft's **Phi** family (Sebastien Bubeck, Ronen Eldan, Yin Tat Lee, and team) took the synthetic-data idea to its logical extreme. Their thesis, captured in the 2023 paper *"Textbooks Are All You Need,"* was that web data is the wrong substrate. A model trained on a small amount of textbook-quality, pedagogically-structured synthetic text would beat a much larger model trained on the open internet. **Phi-1** was 1.3 billion parameters trained on 7 billion tokens of filtered code and synthetic textbook content. It beat models 10-25x its size on HumanEval coding benchmarks. **Phi-2** (2.7B parameters) matched 13B-parameter Llama-2 on most reasoning tasks. By **Phi-3** and **Phi-4** (2024), Microsoft was generating most of the pretraining corpus synthetically — GPT-4 writing graduate-level explanations of physics, law, biology, and code, then Phi was trained on those. The Phi line crystallized a finding that has now become consensus across the field — **token for token, high-quality synthetic data outperforms scraped web data**, often by a large margin. The internet has breadth. Synthetic has signal density.

Open pipelines that ship the recipe

Three open efforts deserve naming. **OpenWebMath** (Keiran Paster et al., 2023) demonstrated that even when staying with real web data, aggressive filtering and reformatting for mathematical content yielded a corpus that punched far above its weight on math reasoning benchmarks. It established the template — find the dense pockets in the open web, extract them, clean them, treat them as gold. **Tulu** (Allen Institute for AI, Hamish Ivison and team, 2023-2024) is the open instruction-tuning pipeline. Tulu 2 and Tulu 3 published not just the model weights but the full data mixtures — what synthetic instructions came from where, what human-written data was blended in, what the deduplication and contamination filters looked like. Tulu 3 in late 2024 matched closed-source instruction-tuned models on most benchmarks using a fully transparent synthetic-plus-curated recipe. **Phi's open siblings** — **Orca** (Microsoft Research, Mukherjee et al., 2023) generated synthetic chain-of-thought explanations from GPT-4 and used them to teach smaller models to reason; **WizardLM** (Xu et al., 2023) introduced "Evol-Instruct," which iteratively rewrites instructions to make them harder and more diverse; **Nemotron-4 340B Instruct** (NVIDIA, 2024) was released specifically as a synthetic-data-generation engine, with permissive licensing meant to let other labs use it to make training corpora.

The catch — model collapse

The 2024 paper *"The Curse of Recursion"* (Shumailov et al., published in Nature) made one thing crisp. If you train a model on data generated by the previous model, and then train the next model on data from that one, and so on — quality degrades. The tails of the distribution disappear. Rare events get forgotten. The model converges to a confident, narrow version of itself. This is **model collapse**. The working solution as of 2026 is straightforward and labor-intensive — synthetic data must be **anchored to ground truth**. That means either (a) the strong teacher model is checked against verifiable signals (code that runs, math that proves, citations that exist), or (b) a hard percentage of the training mix stays human-authored or directly extracted from primary sources. The current consensus mix is roughly 30-70% synthetic to 70-30% real, depending on domain.

Where it stands in 2026

The economics inverted exactly as predicted. A frontier lab can spend $50 million on GPU-hours generating synthetic curricula and get more capability lift than it would from another $200 million spent on human labelers. Anthropic, OpenAI, Google DeepMind, Meta, Microsoft, and xAI all run hybrid pipelines now — human-written seeds, model-generated expansions, model-as-judge filtering, ground-truth verifiers wherever the domain allows them (code, math, scientific facts). The next frontier is **active synthetic data** — models that don't just generate training corpora but identify their own weaknesses and generate targeted curricula to fix them. DeepMind's work on AlphaProof and AlphaGeometry, OpenAI's reasoning-data pipeline behind o1 and o3, and Anthropic's constitutional-AI data generation all point in this direction. The internet trained the first generation of large models. The second generation is training itself.

← atlas index