
Post-training: surgery on a model you didn't pretrain
A field guide to LoRA, QLoRA, DPO, distillation, merging, and the question almost nobody asks first — should I be fine-tuning at all?
What post-training actually means
The minimum-effective-dose ladder
Before you reach for any technique on this page, climb this ladder from the bottom. Every rung above zero costs money, time, and the ongoing maintenance of a custom artifact. Stay on the lowest rung that actually solves the problem.
- Rung 0 — better prompt. Most behavioral problems are prompt problems. Rewrite the system prompt with explicit constraints, worked examples, and a clear output schema. Free, takes an hour, no maintenance.
- Rung 1 — few-shot examples in context. Drop three to ten high-quality examples into the prompt. This is the closest thing to gradient descent that doesn't require gradient descent. Still free, still no maintenance.
- Rung 2 — retrieval-augmented generation (RAG). For domain knowledge, freshness, and proprietary data, RAG beats fine-tuning on cost and traceability. The model stays general; the facts live in a searchable index you control.
- Rung 3 — tool use and function calling. If the model needs to compute, query, or act, give it tools. Don't try to fine-tune arithmetic or API knowledge into the weights.
- Rung 4 — LoRA or QLoRA on a small open-weights model. The first rung that involves training. Used for style, voice, domain-specific output formats, or when you need a self-hosted model that behaves a specific way.
- Rung 5 — full fine-tuning, distillation, or preference optimization (DPO). Reserved for serious work — production systems with measurable evaluations, real distribution shift, and a team that can maintain the artifact.
The PEFT family at a glance
Parameter-efficient fine-tuning methods all share the same idea — freeze the original weights, add a small number of new trainable parameters, and let the model adapt through the new parameters only. They differ in where the new parameters live and what they look like. Numbers below are typical ranges from the cited papers; your mileage will vary by model size and task.
LoRA and QLoRA — the workhorses
DPO replaced RLHF for most teams
Reinforcement learning from human feedback was the post-training secret of GPT-3.5 and the original ChatGPT — a three-stage pipeline of supervised fine-tuning, reward model training, and PPO. It worked, but it was unstable, expensive, and required a separate reward model. Rafailov et al. (2023) at Stanford proposed Direct Preference Optimization: a single-stage loss function that optimizes the policy directly on a dataset of (preferred, rejected) pairs, with no reward model and no on-policy rollouts. The paper's title — 'Your Language Model is Secretly a Reward Model' — captures the trick. DPO is a closed-form rewriting of the RLHF objective. Between 2023 and 2026, DPO and its descendants (IPO, KTO, ORPO, and others — check the literature for the current state of the art) became the default preference-optimization method for open-weights teams. The original RLHF pipeline is still used at frontier labs where the budget and engineering capacity exist to do it well, but for everyone else, DPO is the answer. If you are an individual trying to align a 7B model to your preferences, DPO on a few thousand pair examples is the technique to reach for — not classical RLHF.
Knowledge distillation — the old reliable
Speculative decoding — inference, not training
Model merging — frankenmodels that sometimes work
When you have several fine-tuned models that each do something well, merging combines them into a single checkpoint with no additional training. The good news: it often works. The bad news: the field is empirical, the failure modes are not well understood, and the literature changes quarterly. The three methods below are the most-cited as of June 2026.
Model Soups (Wortsman et al., ICML 2022)
arxiv.org/abs/2203.05482
Simple uniform or greedy averaging of the weights of multiple models fine-tuned from the same pretrained base. The 'greedy soup' variant adds models one at a time only if the average improves on a held-out set. The paper's striking claim: averaging produces models that exceed the best single fine-tune at the same inference cost.
TIES-Merging (Yadav et al., NeurIPS 2023)
arxiv.org/abs/2306.01708
Three steps: trim small-magnitude changes from each task vector, resolve sign conflicts by picking the dominant direction, and merge only the parameters that agree. Designed to address the 'interference' problem where naive averaging cancels useful updates. Works better than soups when the constituent models were fine-tuned for genuinely different tasks.
DARE (Yu et al., 2023; ICML 2024)
arxiv.org/abs/2311.03099
Drop a large fraction (sometimes 90%+) of the delta parameters randomly and rescale the rest. The result merges cleanly with other DARE'd models. The catchy title — 'Language Models are Super Mario' — undersells a method that quietly became a default preprocessing step before TIES or soup merging in 2024-2025.
When should an individual actually fine-tune?
The honest answer is: almost never. Fine-tuning is a real engineering commitment with real ongoing costs — dataset curation, evaluation infrastructure, base-model version pinning, and the risk that you are encoding biases you didn't measure. Below are the conditions under which it earns its place. If none of these apply, climb back down the ladder.
- You have a measurable evaluation set with at least a few hundred labeled examples and a metric you actually trust. Without this, you cannot tell whether fine-tuning helped or hurt.
- You have tried prompting, few-shot, and RAG, and you have a written explanation of why each was insufficient — not just a vibe.
- The task involves a distribution that is genuinely under-represented in pretraining data — a specific technical domain, a non-English language with limited web presence, a structured output format the base model gets subtly wrong.
- You can self-host or you have a fine-tuning API contract with a frontier provider, and you have budgeted for the artifact's full lifecycle (training, serving, monitoring, refresh on base-model updates).
- You are not trying to teach the model facts. Facts belong in RAG. Fine-tuning teaches behavior, style, and format — it is a bad way to encode knowledge that should be retrievable, auditable, and updatable.
RAG vs fine-tuning vs prompt engineering
A decision table for the most common cases. None of these techniques are mutually exclusive — production systems often use all three — but the table answers 'which one first.'
Practical paths — three honest scenarios
Concrete cases that map to the rungs above. The point is to show what 'enough' looks like at each level so you can recognize when you've reached it.
Scenario: solo founder, customer-support assistant
You want a chatbot that answers questions about your product. Right answer in 2026: a frontier model (Claude, GPT-class, or a strong open-weights model) with RAG over your docs and a tight system prompt. Do not fine-tune. The knowledge changes monthly; RAG keeps it fresh and traceable. Fine-tuning would lock yesterday's docs into the weights.
Scenario: small team, specialized writing voice
You want a model that drafts in a specific brand voice across thousands of pieces. Right answer: collect 500-2000 high-quality examples of the voice, run QLoRA on an open-weights 7B-30B model, and serve the adapter. Voice is a behavior, not a fact — this is what LoRA is for. Budget the time for an evaluation harness before you start training.
Scenario: research group, smaller fast student model
You have a large teacher model that works well but is too slow or expensive to serve at scale. Right answer: distill it. Generate teacher outputs (including chain-of-thought traces if the task warrants) on a curated input distribution, then train a small student on those outputs. This is what Hinton-style distillation is for; modern variants like rejection-sampling fine-tuning are refinements of the same idea.
What this page deliberately doesn't cover
Reinforcement learning from AI feedback (RLAIF), constitutional AI, process reward models, online DPO, the full menu of preference-optimization variants (IPO, KTO, ORPO, SimPO, and whatever shipped last month), MoE-specific fine-tuning, multimodal post-training, and the entire on-device quantization stack (GPTQ, AWQ, GGUF) deserve their own pages and will get them. This page is the orientation map. The rule of thumb across all of them remains the same — minimum effective dose, real evaluation, honest reporting of what worked and what didn't. The techniques change every quarter; the discipline doesn't.
Sources
- [01]
LoRA: Low-Rank Adaptation of Large Language Models — Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen (Microsoft, 2021). The original LoRA paper.
arxiv.org/abs/2106.09685
- [02]
QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers, Pagnoni, Holtzman, Zettlemoyer (UW, 2023). Introduces NF4, double quantization, paged optimizers; enables 65B fine-tune on a single 48GB GPU.
arxiv.org/abs/2305.14314
- [03]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov, Sharma, Mitchell, Ermon, Manning, Finn (Stanford, 2023). The DPO paper.
arxiv.org/abs/2305.18290
- [04]
Parameter-Efficient Transfer Learning for NLP — Houlsby et al. (2019). The original adapters paper; ~3.6% added parameters per task.
arxiv.org/abs/1902.00751
- [05]
Prefix-Tuning: Optimizing Continuous Prompts for Generation — Li and Liang (Stanford, 2021).
arxiv.org/abs/2101.00190
- [06]
The Power of Scale for Parameter-Efficient Prompt Tuning — Lester, Al-Rfou, Constant (Google, 2021).
arxiv.org/abs/2104.08691
- [07]
GPT Understands, Too — Liu et al. (2021). The P-tuning paper, combining trainable continuous prompts with discrete prompts.
arxiv.org/abs/2103.10385
- [08]
Distilling the Knowledge in a Neural Network — Hinton, Vinyals, Dean (2015). The foundational knowledge distillation paper.
arxiv.org/abs/1503.02531
- [09]
Fast Inference from Transformers via Speculative Decoding — Leviathan, Kalman, Matias (Google, ICML 2023). Reports 2x-3x speedup on T5-XXL.
arxiv.org/abs/2211.17192
- [10]
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time — Wortsman et al. (ICML 2022).
arxiv.org/abs/2203.05482
- [11]
TIES-Merging: Resolving Interference When Merging Models — Yadav, Tam, Choshen, Raffel, Bansal (NeurIPS 2023).
arxiv.org/abs/2306.01708
- [12]
Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch — Yu, Yu, Yu, Huang, Li (2023; ICML 2024). The DARE method for delta-parameter drop-and-rescale before merging.
arxiv.org/abs/2311.03099
- [13]
Hugging Face PEFT library documentation — official entry point covering LoRA, prefix tuning, prompt tuning, P-tuning, and adapter methods integrated with Transformers, Diffusers, and Accelerate.
huggingface.co/docs/peft/index
- [14]
Hugging Face Transformers PEFT integration docs — covers loading and training PEFT adapters within the Transformers library.
huggingface.co/docs/transformers/main/en/peft