A watchmaker's precision tool kit on dark felt — post-training is the craft layer.

AtomEons / Learn / atlas / post-training

Post-training: surgery on a model you didn't pretrain

A field guide to LoRA, QLoRA, DPO, distillation, merging, and the question almost nobody asks first — should I be fine-tuning at all?

Pretraining a frontier model costs tens to hundreds of millions of dollars. Post-training is what almost everyone else means when they say "AI work." It is the layer of techniques used to bend an already-trained model toward a specific task, voice, domain, or behavior — without touching the original parameters in any expensive way. Most of what you read on the internet about "training an AI" is actually post-training. This page is a working map of the surgical instruments. LoRA and QLoRA, which let you adapt a 65-billion-parameter model on a single consumer GPU. DPO, which replaced the RLHF pipeline for most preference-tuning work between 2023 and 2025. The PEFT family — adapters, prompt tuning, prefix tuning, P-tuning — most of them now exposed as one-line calls in Hugging Face's PEFT library. Knowledge distillation, which Hinton, Vinyals, and Dean formalized in 2015 and which still underwrites every "small model that punches above its weight." Speculative decoding, which is technically inference rather than training but lives in the same toolkit. And the merging methods — TIES, DARE, model soups — that let you combine several fine-tuned checkpoints without retraining. The honest part of the page is the last section, which says: for an individual builder in 2026, the answer to "should I fine-tune?" is almost always no. Retrieval-augmented generation is cheaper. Prompt engineering is faster. A good system prompt plus a small evaluation set will out-perform an undisciplined LoRA run on a Tuesday afternoon. Fine-tuning earns its place when you have a real distribution shift, a real evaluation harness, and a real reason the base model cannot already do the job. We will name the conditions. Voice: plain English, real citations, no marketing. If something is uncertain as of June 2026, the prose will say so.

What post-training actually means

Pretraining is the multi-month, multi-million-dollar pass where a model learns next-token prediction on a corpus the size of the indexable internet. Post-training is everything you do after that. It includes supervised fine-tuning on curated examples, preference optimization (RLHF, DPO, and friends), parameter-efficient adaptation methods like LoRA, knowledge distillation into smaller students, quantization, model merging, and inference-time tricks like speculative decoding. The defining feature is that you are not relearning the world. You are nudging a model that already knows the world toward a specific shape of behavior. The practical consequence is that the cost surface collapses by three to five orders of magnitude. A LoRA run that adapts a 7B model to your writing voice can finish overnight on a single rented A100. A QLoRA run can fit a 65B model on a 48GB card. A distillation run that produces a small student capable of impersonating a much larger teacher is a weekend project for someone with discipline and a clean dataset. Pretraining was for Google and Meta and Anthropic. Post-training is for the rest of us, when it is needed at all.

The minimum-effective-dose ladder

Before you reach for any technique on this page, climb this ladder from the bottom. Every rung above zero costs money, time, and the ongoing maintenance of a custom artifact. Stay on the lowest rung that actually solves the problem.

Rung 0 — better prompt. Most behavioral problems are prompt problems. Rewrite the system prompt with explicit constraints, worked examples, and a clear output schema. Free, takes an hour, no maintenance.
Rung 1 — few-shot examples in context. Drop three to ten high-quality examples into the prompt. This is the closest thing to gradient descent that doesn't require gradient descent. Still free, still no maintenance.
Rung 2 — retrieval-augmented generation (RAG). For domain knowledge, freshness, and proprietary data, RAG beats fine-tuning on cost and traceability. The model stays general; the facts live in a searchable index you control.
Rung 3 — tool use and function calling. If the model needs to compute, query, or act, give it tools. Don't try to fine-tune arithmetic or API knowledge into the weights.
Rung 4 — LoRA or QLoRA on a small open-weights model. The first rung that involves training. Used for style, voice, domain-specific output formats, or when you need a self-hosted model that behaves a specific way.
Rung 5 — full fine-tuning, distillation, or preference optimization (DPO). Reserved for serious work — production systems with measurable evaluations, real distribution shift, and a team that can maintain the artifact.

The PEFT family at a glance

Parameter-efficient fine-tuning methods all share the same idea — freeze the original weights, add a small number of new trainable parameters, and let the model adapt through the new parameters only. They differ in where the new parameters live and what they look like. Numbers below are typical ranges from the cited papers; your mileage will vary by model size and task.

Method	Year	Where the new parameters live	Typical trainable %	Primary citation
Adapters (Houlsby)	2019	Bottleneck modules inserted between transformer layers	~3.6% per task	arxiv.org/abs/1902.00751
Prefix tuning	2021	Continuous task-specific vectors prepended in attention	~0.1%	arxiv.org/abs/2101.00190
Prompt tuning	2021	Soft prompt embeddings at the input layer only	<0.1%	arxiv.org/abs/2104.08691
P-tuning	2021	Trainable continuous prompts combined with discrete prompts	<0.1%	arxiv.org/abs/2103.10385
LoRA	2021	Low-rank decomposition added in parallel to attention weights	~0.1-1%	arxiv.org/abs/2106.09685
QLoRA	2023	LoRA on top of a 4-bit quantized frozen base model	~0.1-1% with ~4x less memory	arxiv.org/abs/2305.14314

MethodAdapters (Houlsby)

Year2019

Where the new parameters liveBottleneck modules inserted between transformer layers

Typical trainable %~3.6% per task

Primary citationarxiv.org/abs/1902.00751

MethodPrefix tuning

Year2021

Where the new parameters liveContinuous task-specific vectors prepended in attention

Typical trainable %~0.1%

Primary citationarxiv.org/abs/2101.00190

MethodPrompt tuning

Year2021

Where the new parameters liveSoft prompt embeddings at the input layer only

Typical trainable %<0.1%

Primary citationarxiv.org/abs/2104.08691

MethodP-tuning

Year2021

Where the new parameters liveTrainable continuous prompts combined with discrete prompts

Typical trainable %<0.1%

Primary citationarxiv.org/abs/2103.10385

MethodLoRA

Year2021

Where the new parameters liveLow-rank decomposition added in parallel to attention weights

Typical trainable %~0.1-1%

Primary citationarxiv.org/abs/2106.09685

MethodQLoRA

Year2023

Where the new parameters liveLoRA on top of a 4-bit quantized frozen base model

Typical trainable %~0.1-1% with ~4x less memory

Primary citationarxiv.org/abs/2305.14314

LoRA and QLoRA — the workhorses

Hu et al. (2021) at Microsoft proposed Low-Rank Adaptation on the observation that the update matrices produced during fine-tuning have low intrinsic rank — that is, the change you actually need to make to a weight matrix W is well-approximated by the product of two much smaller matrices, B and A, where the rank r is typically 4, 8, 16, or 32. Instead of updating W directly, LoRA freezes it and learns BA in parallel. At inference you can either fold BA back into W (zero latency cost) or keep them separate (so you can swap adapters per request). A LoRA adapter for a 7B model is typically a few tens of megabytes, not the 14GB of the full weights. You can train one on a single consumer GPU in hours and serve dozens against the same base model in production. The Hugging Face PEFT library exposes LoRA as a roughly five-line wrapper around any Transformers model, which is why it became the default for community fine-tuning between 2022 and 2026. Dettmers et al. (2023) at UW pushed LoRA further with three quantization tricks. First, 4-bit NormalFloat (NF4), a data type optimized for the approximately-Gaussian distribution of trained weights. Second, double quantization, where the quantization constants themselves are quantized. Third, paged optimizers, which use NVIDIA unified memory to handle the gradient-spike moments that would otherwise blow out VRAM. Together these let you fine-tune a 65B-parameter model on a single 48GB GPU while preserving the task performance of 16-bit fine-tuning. That sentence would have been science fiction in 2022. QLoRA is what made open-weights fine-tuning a hobbyist activity, and as of June 2026 it remains the default starting point for anyone fine-tuning a 30B-plus model on consumer hardware. The failure modes are real and worth naming. LoRA tends to under-perform full fine-tuning when the task requires the model to acquire genuinely new capabilities rather than redirect existing ones. Rank selection is a free hyperparameter most people pick by superstition. And — the one nobody talks about — a LoRA trained on a base model is brittle to base-model upgrades; when the upstream model is retrained, your adapter may need to be re-trained too.

DPO replaced RLHF for most teams

Reinforcement learning from human feedback was the post-training secret of GPT-3.5 and the original ChatGPT — a three-stage pipeline of supervised fine-tuning, reward model training, and PPO. It worked, but it was unstable, expensive, and required a separate reward model. Rafailov et al. (2023) at Stanford proposed Direct Preference Optimization: a single-stage loss function that optimizes the policy directly on a dataset of (preferred, rejected) pairs, with no reward model and no on-policy rollouts. The paper's title — 'Your Language Model is Secretly a Reward Model' — captures the trick. DPO is a closed-form rewriting of the RLHF objective. Between 2023 and 2026, DPO and its descendants (IPO, KTO, ORPO, and others — check the literature for the current state of the art) became the default preference-optimization method for open-weights teams. The original RLHF pipeline is still used at frontier labs where the budget and engineering capacity exist to do it well, but for everyone else, DPO is the answer. If you are an individual trying to align a 7B model to your preferences, DPO on a few thousand pair examples is the technique to reach for — not classical RLHF.

Knowledge distillation — the old reliable

Hinton, Vinyals, and Dean (2015) formalized knowledge distillation as training a small 'student' model to match the soft probability distribution produced by a large 'teacher' model, rather than the hard labels in the original dataset. The soft distribution carries more information — it tells the student not just which class is correct but how confident the teacher is and which alternatives the teacher considered. A temperature parameter softens the teacher's distribution further to expose this structure. Distillation has aged remarkably well. Almost every 'small model that punches above its weight' announcement since 2020 — from DistilBERT through the Phi family through the various Llama-derived small models — has distillation in its lineage somewhere. The honest take in 2026 is that distillation is the underlying technique whenever a lab ships a smaller model that 'matches' a larger one on some benchmark: they did not get lucky. They trained the small one on the outputs of the big one, often using rejection sampling or chain-of-thought traces from the teacher. The original Hinton paper is the conceptual root; the modern practice is far richer.

Speculative decoding — inference, not training

Leviathan, Kalman, and Matias (Google, ICML 2023) introduced speculative decoding as a way to make autoregressive generation faster without changing the output distribution at all. The idea: run a small fast 'draft' model to propose the next K tokens, then run the large 'target' model once in parallel to verify them. Tokens the target agrees with are accepted; the first disagreement causes a rollback. Because the target's forward pass is dominated by memory bandwidth rather than arithmetic, verifying K tokens costs roughly the same as generating one, so the speedup is real. The paper reports 2x–3x wall-clock speedups on T5-XXL with no quality loss. This belongs in a post-training page even though it is technically an inference technique, because the draft model is usually distilled from or trained alongside the target model — and because in production it is the single highest-leverage knob for serving open-weights models cheaply. If you self-host a 70B model and you are not using speculative decoding (or one of its descendants — Medusa, EAGLE, lookahead decoding — all of which improve on the original), you are leaving money on the table. As of June 2026, most serious open-weights serving stacks (vLLM, TGI) ship with speculative decoding support; check provider docs for current details.

Model merging — frankenmodels that sometimes work

When you have several fine-tuned models that each do something well, merging combines them into a single checkpoint with no additional training. The good news: it often works. The bad news: the field is empirical, the failure modes are not well understood, and the literature changes quarterly. The three methods below are the most-cited as of June 2026.

Model Soups (Wortsman et al., ICML 2022)

arxiv.org/abs/2203.05482

Simple uniform or greedy averaging of the weights of multiple models fine-tuned from the same pretrained base. The 'greedy soup' variant adds models one at a time only if the average improves on a held-out set. The paper's striking claim: averaging produces models that exceed the best single fine-tune at the same inference cost.

TIES-Merging (Yadav et al., NeurIPS 2023)

arxiv.org/abs/2306.01708

Three steps: trim small-magnitude changes from each task vector, resolve sign conflicts by picking the dominant direction, and merge only the parameters that agree. Designed to address the 'interference' problem where naive averaging cancels useful updates. Works better than soups when the constituent models were fine-tuned for genuinely different tasks.

DARE (Yu et al., 2023; ICML 2024)

arxiv.org/abs/2311.03099

Drop a large fraction (sometimes 90%+) of the delta parameters randomly and rescale the rest. The result merges cleanly with other DARE'd models. The catchy title — 'Language Models are Super Mario' — undersells a method that quietly became a default preprocessing step before TIES or soup merging in 2024-2025.

When should an individual actually fine-tune?

The honest answer is: almost never. Fine-tuning is a real engineering commitment with real ongoing costs — dataset curation, evaluation infrastructure, base-model version pinning, and the risk that you are encoding biases you didn't measure. Below are the conditions under which it earns its place. If none of these apply, climb back down the ladder.

You have a measurable evaluation set with at least a few hundred labeled examples and a metric you actually trust. Without this, you cannot tell whether fine-tuning helped or hurt.
You have tried prompting, few-shot, and RAG, and you have a written explanation of why each was insufficient — not just a vibe.
The task involves a distribution that is genuinely under-represented in pretraining data — a specific technical domain, a non-English language with limited web presence, a structured output format the base model gets subtly wrong.
You can self-host or you have a fine-tuning API contract with a frontier provider, and you have budgeted for the artifact's full lifecycle (training, serving, monitoring, refresh on base-model updates).
You are not trying to teach the model facts. Facts belong in RAG. Fine-tuning teaches behavior, style, and format — it is a bad way to encode knowledge that should be retrievable, auditable, and updatable.

RAG vs fine-tuning vs prompt engineering

A decision table for the most common cases. None of these techniques are mutually exclusive — production systems often use all three — but the table answers 'which one first.'

Problem shape	First tool to reach for	Why
Model doesn't know your company's internal docs	RAG	Knowledge changes; RAG is updatable and traceable. Fine-tuning facts in is brittle and unauditable.
Model's output format is inconsistent	Prompt + schema enforcement	JSON schema, function calling, or strict prompting solves this without training.
Model's tone is wrong for your product	Prompt + few-shot	Style transfer is one of the things prompts do well. Try this before LoRA.
Tone is wrong AND you have 1000+ correct examples	LoRA / QLoRA	Past a threshold of examples, training catches up to and beats long prompts.
Need a small fast model that behaves like a big one	Distillation	Hinton 2015 is the foundation; modern variants extend it.
Need to align model to subjective preferences	DPO on preference pairs	Replaced RLHF for most teams; one-stage, stable, no reward model.
Need lower latency on a self-hosted model	Speculative decoding	Inference-time fix, no training required, 2-3x speedup typical.
You have multiple specialist fine-tunes	Merging (TIES / DARE / soup)	Often produces a stronger single model than any constituent.

Problem shapeModel doesn't know your company's internal docs

First tool to reach forRAG

WhyKnowledge changes; RAG is updatable and traceable. Fine-tuning facts in is brittle and unauditable.

Problem shapeModel's output format is inconsistent

First tool to reach forPrompt + schema enforcement

WhyJSON schema, function calling, or strict prompting solves this without training.

Problem shapeModel's tone is wrong for your product

First tool to reach forPrompt + few-shot

WhyStyle transfer is one of the things prompts do well. Try this before LoRA.

Problem shapeTone is wrong AND you have 1000+ correct examples

First tool to reach forLoRA / QLoRA

WhyPast a threshold of examples, training catches up to and beats long prompts.

Problem shapeNeed a small fast model that behaves like a big one

First tool to reach forDistillation

WhyHinton 2015 is the foundation; modern variants extend it.

Problem shapeNeed to align model to subjective preferences

First tool to reach forDPO on preference pairs

WhyReplaced RLHF for most teams; one-stage, stable, no reward model.

Problem shapeNeed lower latency on a self-hosted model

First tool to reach forSpeculative decoding

WhyInference-time fix, no training required, 2-3x speedup typical.

Problem shapeYou have multiple specialist fine-tunes

First tool to reach forMerging (TIES / DARE / soup)

WhyOften produces a stronger single model than any constituent.

Practical paths — three honest scenarios

Concrete cases that map to the rungs above. The point is to show what 'enough' looks like at each level so you can recognize when you've reached it.

Scenario: solo founder, customer-support assistant

You want a chatbot that answers questions about your product. Right answer in 2026: a frontier model (Claude, GPT-class, or a strong open-weights model) with RAG over your docs and a tight system prompt. Do not fine-tune. The knowledge changes monthly; RAG keeps it fresh and traceable. Fine-tuning would lock yesterday's docs into the weights.

Scenario: small team, specialized writing voice

You want a model that drafts in a specific brand voice across thousands of pieces. Right answer: collect 500-2000 high-quality examples of the voice, run QLoRA on an open-weights 7B-30B model, and serve the adapter. Voice is a behavior, not a fact — this is what LoRA is for. Budget the time for an evaluation harness before you start training.

Scenario: research group, smaller fast student model

You have a large teacher model that works well but is too slow or expensive to serve at scale. Right answer: distill it. Generate teacher outputs (including chain-of-thought traces if the task warrants) on a curated input distribution, then train a small student on those outputs. This is what Hinton-style distillation is for; modern variants like rejection-sampling fine-tuning are refinements of the same idea.

What this page deliberately doesn't cover

Reinforcement learning from AI feedback (RLAIF), constitutional AI, process reward models, online DPO, the full menu of preference-optimization variants (IPO, KTO, ORPO, SimPO, and whatever shipped last month), MoE-specific fine-tuning, multimodal post-training, and the entire on-device quantization stack (GPTQ, AWQ, GGUF) deserve their own pages and will get them. This page is the orientation map. The rule of thumb across all of them remains the same — minimum effective dose, real evaluation, honest reporting of what worked and what didn't. The techniques change every quarter; the discipline doesn't.

Sources

[01]
LoRA: Low-Rank Adaptation of Large Language Models — Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen (Microsoft, 2021). The original LoRA paper.
arxiv.org/abs/2106.09685
[02]
QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers, Pagnoni, Holtzman, Zettlemoyer (UW, 2023). Introduces NF4, double quantization, paged optimizers; enables 65B fine-tune on a single 48GB GPU.
arxiv.org/abs/2305.14314
[03]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov, Sharma, Mitchell, Ermon, Manning, Finn (Stanford, 2023). The DPO paper.
arxiv.org/abs/2305.18290
[04]
Parameter-Efficient Transfer Learning for NLP — Houlsby et al. (2019). The original adapters paper; ~3.6% added parameters per task.
arxiv.org/abs/1902.00751
[05]
Prefix-Tuning: Optimizing Continuous Prompts for Generation — Li and Liang (Stanford, 2021).
arxiv.org/abs/2101.00190
[06]
The Power of Scale for Parameter-Efficient Prompt Tuning — Lester, Al-Rfou, Constant (Google, 2021).
arxiv.org/abs/2104.08691
[07]
GPT Understands, Too — Liu et al. (2021). The P-tuning paper, combining trainable continuous prompts with discrete prompts.
arxiv.org/abs/2103.10385
[08]
Distilling the Knowledge in a Neural Network — Hinton, Vinyals, Dean (2015). The foundational knowledge distillation paper.
arxiv.org/abs/1503.02531
[09]
Fast Inference from Transformers via Speculative Decoding — Leviathan, Kalman, Matias (Google, ICML 2023). Reports 2x-3x speedup on T5-XXL.
arxiv.org/abs/2211.17192
[10]
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time — Wortsman et al. (ICML 2022).
arxiv.org/abs/2203.05482
[11]
TIES-Merging: Resolving Interference When Merging Models — Yadav, Tam, Choshen, Raffel, Bansal (NeurIPS 2023).
arxiv.org/abs/2306.01708
[12]
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch — Yu, Yu, Yu, Huang, Li (2023; ICML 2024). The DARE method for delta-parameter drop-and-rescale before merging.
arxiv.org/abs/2311.03099
[13]
Hugging Face PEFT library documentation — official entry point covering LoRA, prefix tuning, prompt tuning, P-tuning, and adapter methods integrated with Transformers, Diffusers, and Accelerate.
huggingface.co/docs/peft/index
[14]
Hugging Face Transformers PEFT integration docs — covers loading and training PEFT adapters within the Transformers library.
huggingface.co/docs/transformers/main/en/peft

Keep reading

Atlas: the AtomEons map of post-training and inference techniques →Learn: RAG vs fine-tuning decision guide →Learn: prompt engineering as a first-class discipline →Research: open-weights model evaluation harnesses →OrangeBox: self-hosted LoRA and QLoRA serving stack →Tools: model merging utilities (TIES, DARE, soups) →vs: DPO vs RLHF vs RLAIF comparison →