Macro of a graphics card heatsink with parallel cooling fins and a bio-cyan LED.

AtomEons / Learn / calc / tools / hardware-calculator

::calculator · Required GPU, RAM, and storage to run open-weight LLMs on your own hardware

Local Model Hardware Calculator

Running a local LLM is bounded by three numbers: how many parameters the model has, how many bits per parameter you keep after quantization, and whether you're just running inference or also computing gradients. Everything else — GPU class, RAM headroom, SSD size — falls out of that math. The core rule is unforgiving. A model's weight footprint is parameters × bits per parameter ÷ 8 bytes. A 70B model in full fp16 (16 bits per weight) is ~140 GB before you've loaded a single token of context. Quantize that same model to q4 (4 bits per weight) and it drops to ~35 GB — suddenly fits on a single 48 GB workstation card. Drop to q2 and you're at ~17.5 GB, runnable on consumer hardware, at a measurable cost in output quality. Inference-only workloads need roughly the weight footprint plus 20% overhead for KV cache, activations, and CUDA workspace. Fine-tuning is a different animal: full fine-tune needs ~4x the inference footprint (weights + gradients + optimizer states + activations), and even LoRA/QLoRA — the cheap path — typically needs ~2x because you still hold the base model plus adapter gradients plus optimizer momentum. This calculator gives you the lower bound. Real-world deployments push higher because of long context windows (KV cache scales with sequence length), batch size, framework overhead (vLLM, llama.cpp, and Hugging Face Transformers each have different memory profiles), and OS reserve. We use a 1.2x multiplier as the honest minimum. For context on the alternative — API spend — June 2026 frontier pricing sits at Claude Sonnet 4.5 at $3/M input and $15/M output, GPT-4o at $2.50/M and $10/M, and Gemini 1.5 Pro at $1.25-$5/M input and $5-$15/M output (tier-dependent). Local makes sense when your token volume crosses the breakeven point against amortized hardware, or when latency, privacy, or offline operation matter more than raw model quality. A used RTX 3090 at $700 hits parity with the Sonnet API around 50M tokens of mixed input/output — fast for power users, never for casual ones. Numbers below assume modern quantization (GGUF, AWQ, GPTQ) and standard inference stacks. If you're doing speculative decoding, MoE routing, or exotic precision (FP8 on H100s), your mileage will differ.

::inputs

Model sizeB params

Total parameter count of the open-weight model you want to run

Quantizationbits/weight

Lower bits = smaller footprint, more quality loss. q4 is the sweet spot for most.

Use case

Fine-tuning multiplies VRAM needs by 2x (LoRA/QLoRA) vs inference-only

::result

VRAM required

7.8

System RAM recommended

15.6

SSD allocation

::how this calculates

VRAM in GB = parameters (in billions) × bits per weight ÷ 8 × 1.2 overhead × use-case multiplier (1x for inference, 2x for LoRA fine-tune). System RAM should be ~2x VRAM to handle model loading, OS reserve, and CPU offload for layers that don't fit on GPU. SSD allocation is parameters × 4 GB to hold the weights, multiple quantization variants, and HuggingFace cache. The 1.2x multiplier accounts for KV cache, activations, and framework overhead at modest context lengths — push past 8K context or batch size 1 and real usage climbs higher.

::worked examples

Hobbyist running Llama 3.1 8B locally for code completion

paramsB: 7quantBits: 4useCase: 1

7B at q4 inference = ~4.2 GB VRAM, fits on a used RTX 3060 12GB ($250) with room for 8K context. Total build under $800. Recommended GPU class: RTX 3060 12GB or better.

Power user running Llama 3.1 70B at q4 on a workstation

paramsB: 70quantBits: 4useCase: 1

70B at q4 inference = ~42 GB VRAM. Needs an RTX A6000 48GB ($4,500 used) or dual RTX 3090s with NVLink (~$1,400 used). This is the realistic floor for Claude-Sonnet-class output quality locally. Recommended GPU class: RTX A6000 48GB or 2x RTX 3090.

Researcher fine-tuning Mixtral 8x7B with QLoRA

paramsB: 30quantBits: 4useCase: 2

30B at q4 with 2x fine-tune multiplier = ~36 GB VRAM. Fits on a single RTX 6000 Ada 48GB ($6,800 new) or rented H100 80GB at $2/hr on Lambda or RunPod. SSD allocation 120 GB handles base weights plus checkpoints. Recommended GPU class: RTX 6000 Ada 48GB or rented H100.

::what this does NOT capture

○1.2x overhead multiplier is the honest minimum for short-context inference. Real-world usage with 32K+ context windows, batched serving, or vLLM PagedAttention can push memory 1.5-2x higher than this calculator shows.
○Fine-tune multiplier of 2x assumes LoRA or QLoRA — adapter-only training. Full parameter fine-tuning needs ~4x (weights + gradients + Adam optimizer states + activations), which is rarely viable on single-GPU consumer hardware above 13B.
○MoE models like Mixtral 8x7B have ~47B total parameters but only ~13B active per forward pass. The calculator treats them as their active-parameter size for VRAM, which matches real llama.cpp behavior but understates total disk footprint.
○Quantization quality varies by method and model. q4 GGUF on Llama 3.1 70B is near-lossless on most benchmarks; q4 on 7B models shows measurable degradation. The calculator does not model quality loss — only memory.
○RAM recommendation of 2x VRAM assumes you may offload some layers to CPU when VRAM is tight. Pure GPU inference can run with 1.5x VRAM in RAM; CPU-heavy offload (llama.cpp with -ngl partial) wants 3x+.
○SSD allocation of paramsB × 4 GB covers fp16 weights plus one quantized variant plus HuggingFace cache. Power users keeping q8, q4, and q2 variants simultaneously should double this.
○GPU class recommendations are based on June 2026 used-market pricing on eBay and r/hardwareswap. New-card pricing from NVIDIA is roughly 2-3x the values implied here. H100/H200 cloud rental at $2-4/hr is often cheaper than buying for sub-1000-hour workloads.
○Numbers do not include OS reserve (typically 1-2 GB VRAM lost to display), framework overhead (~500 MB for PyTorch baseline), or KV cache scaling with context length (linear in tokens × layers × heads × head_dim).

← back to /learn·/tools index →