
::calculator · Required GPU, RAM, and storage to run open-weight LLMs on your own hardware
Local Model Hardware Calculator
::inputs
Total parameter count of the open-weight model you want to run
Lower bits = smaller footprint, more quality loss. q4 is the sweet spot for most.
Fine-tuning multiplies VRAM needs by 2x (LoRA/QLoRA) vs inference-only
::result
VRAM required
7.8
System RAM recommended
15.6
SSD allocation
52
::how this calculates
VRAM in GB = parameters (in billions) × bits per weight ÷ 8 × 1.2 overhead × use-case multiplier (1x for inference, 2x for LoRA fine-tune). System RAM should be ~2x VRAM to handle model loading, OS reserve, and CPU offload for layers that don't fit on GPU. SSD allocation is parameters × 4 GB to hold the weights, multiple quantization variants, and HuggingFace cache. The 1.2x multiplier accounts for KV cache, activations, and framework overhead at modest context lengths — push past 8K context or batch size 1 and real usage climbs higher.
::worked examples
Hobbyist running Llama 3.1 8B locally for code completion
7B at q4 inference = ~4.2 GB VRAM, fits on a used RTX 3060 12GB ($250) with room for 8K context. Total build under $800. Recommended GPU class: RTX 3060 12GB or better.
Power user running Llama 3.1 70B at q4 on a workstation
70B at q4 inference = ~42 GB VRAM. Needs an RTX A6000 48GB ($4,500 used) or dual RTX 3090s with NVLink (~$1,400 used). This is the realistic floor for Claude-Sonnet-class output quality locally. Recommended GPU class: RTX A6000 48GB or 2x RTX 3090.
Researcher fine-tuning Mixtral 8x7B with QLoRA
30B at q4 with 2x fine-tune multiplier = ~36 GB VRAM. Fits on a single RTX 6000 Ada 48GB ($6,800 new) or rented H100 80GB at $2/hr on Lambda or RunPod. SSD allocation 120 GB handles base weights plus checkpoints. Recommended GPU class: RTX 6000 Ada 48GB or rented H100.
::what this does NOT capture
- ○1.2x overhead multiplier is the honest minimum for short-context inference. Real-world usage with 32K+ context windows, batched serving, or vLLM PagedAttention can push memory 1.5-2x higher than this calculator shows.
- ○Fine-tune multiplier of 2x assumes LoRA or QLoRA — adapter-only training. Full parameter fine-tuning needs ~4x (weights + gradients + Adam optimizer states + activations), which is rarely viable on single-GPU consumer hardware above 13B.
- ○MoE models like Mixtral 8x7B have ~47B total parameters but only ~13B active per forward pass. The calculator treats them as their active-parameter size for VRAM, which matches real llama.cpp behavior but understates total disk footprint.
- ○Quantization quality varies by method and model. q4 GGUF on Llama 3.1 70B is near-lossless on most benchmarks; q4 on 7B models shows measurable degradation. The calculator does not model quality loss — only memory.
- ○RAM recommendation of 2x VRAM assumes you may offload some layers to CPU when VRAM is tight. Pure GPU inference can run with 1.5x VRAM in RAM; CPU-heavy offload (llama.cpp with -ngl partial) wants 3x+.
- ○SSD allocation of paramsB × 4 GB covers fp16 weights plus one quantized variant plus HuggingFace cache. Power users keeping q8, q4, and q2 variants simultaneously should double this.
- ○GPU class recommendations are based on June 2026 used-market pricing on eBay and r/hardwareswap. New-card pricing from NVIDIA is roughly 2-3x the values implied here. H100/H200 cloud rental at $2-4/hr is often cheaper than buying for sub-1000-hour workloads.
- ○Numbers do not include OS reserve (typically 1-2 GB VRAM lost to display), framework overhead (~500 MB for PyTorch baseline), or KV cache scaling with context length (linear in tokens × layers × heads × head_dim).