Seven identical matte-black gears in a hexagonal cluster, one glowing cyan.

Quantization · the atlas

How big models run on small hardware.

Quantization is the reason a Llama 3.3 70B model fits on a $2,000 consumer GPU instead of a $30,000 datacenter card. It's also the reason your output sometimes quietly degrades. This page walks the formats, the methods, the real quality tradeoffs, and what to actually run on what hardware.

The number formats

Six precision levels.

FP32

32 bits

Full single-precision floating point. The training format until BF16/FP16 mixed-precision took over ~2018. Almost nobody uses FP32 for inference in 2026.

FP16 / BF16

16 bits

Half-precision. BF16 (Google's brain-float-16) has wider dynamic range than FP16 and is the dominant training format in 2026. Inference at FP16/BF16 is the 'no compromise' baseline.

FP8

8 bits

Newer (H100 + Blackwell hardware support). Two variants: E4M3 (4-bit exponent, 3-bit mantissa) for forward pass, E5M2 (5-bit exponent, 2-bit mantissa) for backward. Frontier-lab training increasingly uses FP8.

INT8

8 bits

8-bit integer quantization. The classic 'cuts model size in half from FP16' format. Good support across hardware. Acceptable quality on most models.

INT4

4 bits

4-bit integer. The aggressive quantization point. Modern methods (GPTQ, AWQ, AQLM) recover most quality from FP16 → INT4 conversion. Standard for consumer-hardware inference.

INT2 / ternary / 1-bit

1-2 bits

Extreme quantization. BitNet (Microsoft, 2024) demonstrated 1.58-bit ternary training preserves most quality. Active research area. Not yet standard for inference.

The methods

Eight quantization techniques.

Post-Training Quantization (PTQ)

Take an already-trained FP16 model and apply quantization at inference time. No retraining required. Fast to deploy. Some quality loss, especially at INT4 and below without sophisticated methods.

When: Default for consumer + edge inference. The path most open-weight models take to your GPU.

Quantization-Aware Training (QAT)

Train the model while simulating quantization in the forward pass. The model learns to compensate for quantization noise during training. Preserves more quality than PTQ at the same bit-width, but requires the full training pipeline.

When: Frontier labs producing their own quantized variants. Out of reach for most teams without training infrastructure.

GPTQ

(Frantar et al. 2022) One-shot post-training quantization that uses approximate second-order information to minimize per-layer quantization error. Standard 4-bit method for many open-weight models. Used by Hugging Face's transformers + AutoGPTQ.

When: When you want 4-bit inference + your model has GPTQ-quantized variants on the Hub.

AWQ

(Lin et al. 2023) Activation-aware Weight Quantization. Observes that not all weights are equal — some channels carry more signal than others. Preserves the salient channels at higher precision while aggressively quantizing the rest. Often produces better quality than GPTQ at INT4.

When: AWQ-quantized variants on Hugging Face are a strong default for INT4 inference.

GGUF (llama.cpp ecosystem)

File format for the llama.cpp inference engine. Supports multiple quantization levels: Q4_K_M, Q5_K_M, Q6_K, Q8_0, etc. Per-block quantization with optional importance-weighting. The de facto consumer-laptop and Apple Silicon inference format.

When: Running models on Macs, on CPUs, on low-VRAM consumer GPUs, on Raspberry Pis. The Ollama + LM Studio + Jan apps all use GGUF.

EXL2 (ExLlamaV2)

Mixed-precision quantization where different layers are quantized to different bit-widths based on measured importance. Can achieve effective bit-widths like 4.65 bits per weight while preserving quality better than uniform 4-bit. Strong for high-end consumer GPUs.

When: Single-GPU enthusiast inference. 24GB+ VRAM target audiences.

AQLM

(Egiazarian et al. 2024) Additive Quantization for Language Models. Uses lookup-codebook-based quantization to push effective bit-widths to 2-3 bits per weight while preserving most quality. Hugging Face has AQLM variants of many open-weight models.

When: When you need extreme size compression and your hardware/runtime supports it.

BitNet b1.58 (Microsoft 2024)

Ternary quantization at training time. Each weight is -1, 0, or +1 (1.58 bits effective). Microsoft demonstrated this preserves quality at scale up to 70B parameters. Implies a future where inference compute drops dramatically. Not yet a deployment standard but an important research direction.

When: Watch this space. Not a 'use today' option but a 'this might reshape inference economics in 18-24 months' bet.

What you lose

Six honest quality observations.

01
FP16/BF16 → INT8 is essentially lossless on most models. Measured perplexity delta is well under 1%. Default INT8 unless something specific blocks it.
02
FP16 → INT4 with modern methods (AWQ, GPTQ, AQLM) typically costs 1-3% on benchmark scores. For most consumer + creative use cases, this is invisible.
03
FP16 → INT4 on reasoning-heavy tasks (math, code, multi-step logic) can cost 5-10%. If you're using a model for AIME-level math, INT8 or FP16 is safer.
04
Long-context performance degrades faster under quantization than short-context. A 4-bit model loses more than a 16-bit model when context fills past 64k tokens.
05
Multilingual + low-resource-language performance is more sensitive to quantization than English. Test in your target languages before assuming quantization is free.
06
Tool-use + function-calling reliability can degrade under aggressive quantization. If your application requires precise JSON output, validate at your target bit-width.

What to run on what hardware.

Apple Silicon (M-series)

GGUF Q4_K_M for 7B-13B models. Q5_K_M / Q6_K for 30B+. Use Ollama or LM Studio. M-Pro/Max with 32-64GB unified memory can comfortably run 70B at Q4.

Consumer NVIDIA (24GB+ VRAM)

EXL2 4.0-4.65 bpw for max quality at consumer scale. AWQ INT4 as a more portable alternative. 24GB fits Llama 3.3 70B at 4-bit comfortably.

Consumer NVIDIA (8-16GB VRAM)

Q4_K_M GGUF on llama.cpp or AWQ INT4 on transformers. 7B-13B models comfortable; 30B borderline.

Datacenter (H100/H200/Blackwell)

FP8 native for frontier-quality inference. BF16 for the no-compromise baseline. INT8/INT4 only when serving rate matters more than per-token quality.

Edge / mobile (phones, Raspberry Pi)

Q4 or Q3 GGUF on llama.cpp. 1-7B models only. iOS uses MLX via Apple's MLX framework or Llama.cpp.

Mixture of experts →Hardware calculator →← atlas index