What is LoRA fine-tuning?
The short answer
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method introduced by Microsoft Research in 2021 that freezes a pretrained model's weights and injects trainable low-rank decomposition matrices into each Transformer layer. It reduces trainable parameters by up to 10,000x and GPU memory by 3x compared to full fine-tuning of GPT-3 175B, while matching or exceeding full fine-tuning quality on downstream tasks.
The longer answer
LoRA was introduced in the paper "LoRA: Low-Rank Adaptation of Large Language Models" by Edward Hu, Yelong Shen, and collaborators at Microsoft Research (arXiv:2106.09685, June 2021). The core insight: when you adapt a large pretrained model to a downstream task, the change in weights has low "intrinsic rank" — meaning the update matrix can be approximated by the product of two much smaller matrices. Instead of updating the full weight matrix W (which for GPT-3 175B is enormous), LoRA freezes W and learns two small matrices A and B such that the effective weight becomes W + BA, where BA is the low-rank update.
Concretely, if W is a d × k matrix, LoRA decomposes the update ΔW = BA where B is d × r and A is r × k, with rank r much smaller than min(d, k). Typical rank values in practice are 4, 8, 16, or 64. For GPT-3 175B, the authors report reducing trainable parameters by 10,000x (from 175B to 17.5M when applying LoRA only to attention query and value projections at rank 4) and reducing the optimizer-state memory footprint by 3x, all while matching or beating full fine-tuning on GLUE, WikiSQL, and SAMSum benchmarks.
LoRA has become the default fine-tuning method for open-weight LLMs because of three properties. First, no inference latency penalty: at deployment, you can merge BA back into W (W' = W + BA), so the served model has identical FLOPs to the base. Second, task-switching is cheap: you keep one frozen base model in GPU memory and swap small LoRA adapters (often 10-200 MB) per task. Third, it composes with quantization. The QLoRA paper (Dettmers et al., arXiv:2305.14314, May 2023) showed you can fine-tune a 65B-parameter model on a single 48GB GPU by quantizing the frozen base to 4-bit NF4 while keeping LoRA adapters in 16-bit, with no measurable quality loss versus 16-bit full fine-tuning on Vicuna evaluation.
The Hugging Face PEFT library (github.com/huggingface/peft) provides the reference open-source implementation, and the method is supported natively in major training stacks including Hugging Face Transformers, PyTorch Lightning, NVIDIA NeMo, and DeepSpeed. Apple's MLX framework, Meta's torchtune, and Microsoft's DeepSpeed-Chat all ship LoRA as a first-class fine-tuning path. As of 2024-2025, follow-on variants include DoRA (Weight-Decomposed Low-Rank Adaptation, arXiv:2402.09353), LoRA+ (arXiv:2402.12354) which uses different learning rates for A and B, and VeRA (arXiv:2310.11454) which shares the random projection matrices across layers to shrink adapter size further.
Practical hyperparameters matter. The two main LoRA knobs are rank r (capacity of the adapter) and alpha (scaling factor; the effective update is (alpha/r) × BA). Common configurations: r=8, alpha=16 for general instruction tuning; r=16-64 for domain adaptation; r=4 for style transfer. Target modules typically include attention query and value projections (the original paper's setting); adding key, output, and MLP projections increases capacity at the cost of more parameters. Learning rates are typically 1e-4 to 5e-4, roughly 10x higher than full fine-tuning.
LoRA is not a free lunch. It underperforms full fine-tuning when the target task requires substantial behavioral shift from the base model (e.g., teaching new languages from scratch), and the choice of rank, alpha, and target modules requires tuning. For continued pretraining on large new corpora, full fine-tuning or higher-rank LoRA (r=128+) is generally preferred.
Key facts
- LoRA was introduced in arXiv:2106.09685 by Hu et al. at Microsoft Research, June 2021.
- Reduces trainable parameters by up to 10,000x for GPT-3 175B compared to full fine-tuning (arXiv:2106.09685, Table 1).
- Reduces GPU memory requirement by 3x for GPT-3 175B fine-tuning (arXiv:2106.09685, abstract).
- Zero inference latency overhead when adapters are merged back into base weights (arXiv:2106.09685, Section 4.1).
- QLoRA (arXiv:2305.14314) enables fine-tuning a 65B model on a single 48GB GPU using 4-bit NF4 quantization plus LoRA adapters.
- Reference open-source implementation: Hugging Face PEFT library at github.com/huggingface/peft (Apache 2.0 license).
- DoRA (arXiv:2402.09353, NVIDIA Research) decomposes weight updates into magnitude and direction for improved LoRA quality.
- Typical LoRA rank values in production: 4, 8, 16, 32, 64 — with alpha commonly set to 2x rank.
- LoRA is natively supported in Hugging Face Transformers, NVIDIA NeMo, Apple MLX, Meta torchtune, and Microsoft DeepSpeed.
- The original LoRA paper was published at ICLR 2022.
Related questions
Sources
- LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021) — arxiv.org/abs/2106.09685
- QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023) — arxiv.org/abs/2305.14314
- DoRA: Weight-Decomposed Low-Rank Adaptation (Liu et al., 2024) — arxiv.org/abs/2402.09353
- LoRA+: Efficient Low Rank Adaptation of Large Models (Hayou et al., 2024) — arxiv.org/abs/2402.12354
- VeRA: Vector-based Random Matrix Adaptation (Kopiczko et al., 2023) — arxiv.org/abs/2310.11454
- Hugging Face PEFT library documentation — huggingface.co/docs/peft
- Microsoft LoRA reference implementation — github.com/microsoft/LoRA
- NVIDIA NeMo PEFT documentation — docs.nvidia.com/nemo-framework