LLM Quantization Explained: FP32 to INT4, RAM Math & Hardware Reality

LLM Quantization Explained: FP32 to INT4, RAM Math and Hardware Reality

Why This Topic Matters

Every ML engineer has said: "Let me just run this model locally."

Then they hit an OOM error, check the model card, and discover a 70B parameter model needs 140 GB of VRAM in FP16. That's four A100s.

Quantization is the engineering decision that bridges the gap between research-grade models and deployment reality. It appears in ML system design interviews, MLOps rounds, and increasingly in GenAI infrastructure discussions.

Interviewers do not just ask "what is quantization?" They ask:

  • "How much RAM does LLaMA-2 70B need in INT4? Walk me through your math."
  • "Your model loses accuracy after INT8 quantization. What do you check first?"
  • "Why does GPTQ outperform naive INT8 on transformer models?"

If you cannot answer with numbers, you are not ready.


Core Mental Model

Before the math, build the right intuition.

A neural network is just a massive list of floating-point numbers called weights. Every weight takes up space in memory. Quantization means storing those numbers with less precision — trading some accuracy for a huge reduction in memory and compute cost.

Think of it like this:

Full precision (FP32) is like writing 3.14159265358979 on a whiteboard.
INT8 quantization is like writing 3.14.
The number is slightly wrong, but you used far less space and can read it much faster.

The real engineering question is: how wrong is "slightly wrong," and at what scale does that error compound?


MCQ — Try This Before Reading Further

A LLaMA-2 7B model is loaded in FP16 precision. A team wants to switch to INT4 quantization using GPTQ. Which statement is MOST accurate?

A) Memory usage drops by 50% compared to FP16
B) Memory usage drops by 75% compared to FP16
C) Memory usage drops by 75%, but inference speed always decreases
D) INT4 is lossless because transformers are naturally low-precision

Correct Answer: B

Why A is wrong: A 50% drop would be INT8 (half of FP16's 2 bytes → 1 byte). INT4 uses 0.5 bytes per parameter, which is a 75% drop from FP16's 2 bytes.

Why C is wrong: INT4 often increases inference throughput on supported hardware (e.g., NVIDIA Ampere with INT4 Tensor Cores) because more weights fit in L2 cache and memory bandwidth is the real bottleneck in LLM inference, not raw compute.

Why D is wrong: Quantization is always lossy. INT4 in particular has a very narrow representable range (16 distinct values with standard quantization), and transformers with outlier activations (a well-documented phenomenon in large models) suffer measurable perplexity degradation.

The hidden concept: The bottleneck in LLM inference is memory bandwidth, not FLOPs. INT4 reduces the bytes transferred from HBM to the compute core, which directly speeds up token generation.


Step-by-Step: The Byte Math Every Engineer Must Know

Step 1 — Understand Floating-Point Formats

Format Bits Bytes Range Use Case
FP32 32 4 ±3.4 × 10³⁸ Training, reference
BF16 16 2 ±3.4 × 10³⁸ (reduced mantissa) Training on TPU/A100
FP16 16 2 ±65504 Inference on most GPUs
INT8 8 1 –128 to 127 Post-training quantization
INT4 4 0.5 –8 to 7 Aggressive compression (GPTQ, GGUF)

The key difference between BF16 and FP16 is not the size but the exponent bits. BF16 preserves the exponent range of FP32 (8 exponent bits), while FP16 has only 5 exponent bits. This is why large-scale training switched to BF16 — overflow errors in FP16 destabilized training on models above ~1B parameters.


Step 2 — Calculate RAM Requirements for Real Models

The formula is simple:

VRAM (GB) = (Number of Parameters × Bytes per Parameter) / 1,000,000,000

LLaMA-2 7B

Precision Bytes/Param VRAM Required
FP32 4 ~28 GB
FP16/BF16 2 ~14 GB
INT8 1 ~7 GB
INT4 0.5 ~3.5 GB

LLaMA-2 70B

Precision Bytes/Param VRAM Required
FP32 4 ~280 GB
FP16/BF16 2 ~140 GB
INT8 1 ~70 GB
INT4 0.5 ~35 GB

GPT-3 175B

Precision Bytes/Param VRAM Required
FP32 4 ~700 GB
FP16/BF16 2 ~350 GB
INT8 1 ~175 GB
INT4 0.5 ~87.5 GB

Important caveat: These numbers are weights only. Running inference also requires memory for the KV cache, activations, and framework overhead. A practical rule of thumb is to add 20–30% overhead on top of the weight footprint. For LLaMA-2 7B INT4, that means ~4.5–4.7 GB total, comfortably fitting in an 8 GB consumer GPU like the RTX 3070.


Step 3 — How Quantization Actually Works Internally

Naive quantization maps a float range to an integer range using a linear scale:

quantized_value = round(float_value / scale_factor)
float_value ≈ quantized_value × scale_factor

The scale_factor is computed per-layer or per-channel and stored separately (this itself takes memory, but it is negligible).

The outlier problem in transformers: Research (Dettmers et al., LLM.int8()) showed that large transformer models develop a small number of "outlier" dimensions — weights and activations with values 100x larger than the median. If you quantize the entire matrix with one scale factor, the outlier "occupies" the full INT8 range, crushing the precision of the other 99.9% of values.

INT8 solution (LLM.int8()): Decompose the matrix. Keep outlier columns in FP16, quantize the rest in INT8. This mixed-precision approach preserves accuracy with near-INT8 memory savings.

INT4 solution (GPTQ): Use a second-order optimization (Hessian approximation) to find the quantization that minimizes the output error layer by layer, compensating for quantization error as it propagates forward.


Hardware Reality: What Can You Actually Run?

Consumer Hardware

GPU VRAM Max Model (INT4)
RTX 3060 12 GB LLaMA-2 7B (with headroom)
RTX 3090 / 4090 24 GB LLaMA-2 13B comfortably
RTX 6000 Ada 48 GB LLaMA-2 34B or Mixtral 8x7B
2× RTX 4090 (NVLink not required for inference) 48 GB usable LLaMA-2 34B or Mistral-7B with large context

Professional / Cloud Hardware

GPU VRAM Max Model (FP16)
A10G (AWS g5) 24 GB LLaMA-2 13B
A100 40 GB 40 GB LLaMA-2 13B (fine-tuned)
A100 80 GB 80 GB LLaMA-2 70B (barely, with tensor parallelism)
2× A100 80 GB 160 GB LLaMA-2 70B comfortably in FP16
8× A100 80 GB 640 GB GPT-3 175B in FP16
H100 80 GB 80 GB LLaMA-2 70B, Falcon 40B

The Tesla T4 Trap (Interview Favorite)

Many engineers assume a 16 GB T4 GPU (common on GCP, AWS) can run LLaMA-2 7B in FP16 since 14 GB fits in 16 GB. In practice, CUDA context + framework overhead + KV cache for even moderate context windows pushes this over the limit. The correct answer is: use INT8 (~7 GB weights + overhead ≈ 10 GB), which fits safely with room for context.


Interview-Style Scenario

Scenario: Your team is deploying a LLaMA-2 13B chat model on a single AWS g5.2xlarge instance (1× A10G, 24 GB VRAM). The product requirement is to support a 4096-token context window. Should you load the model in FP16 or INT4?

What a strong candidate notices:

  • LLaMA-2 13B in FP16 = 13B × 2 bytes = 26 GB. This already exceeds the 24 GB A10G. Eliminated.
  • LLaMA-2 13B in INT8 = 13B × 1 byte = 13 GB weights. Add 20% overhead → ~15.6 GB. KV cache for 4096 tokens with 40 heads and 128 head dim at FP16: 40 layers × 2 (K+V) × 4096 × 5120 × 2 bytes ≈ ~3.3 GB. Total ≈ ~19 GB. Tight but feasible.
  • LLaMA-2 13B in INT4 = 13B × 0.5 bytes = 6.5 GB weights. Total with overhead and KV cache ≈ ~12 GB. Comfortable.
  • Strong answer: INT4 with GPTQ or GGUF is the right default. The perplexity loss (~0.1–0.3 on common benchmarks) is acceptable for a chat use case. INT8 is a fallback if quality is noticeably degraded. FP16 requires a hardware upgrade (A100 80 GB or multi-GPU).

Common Mistakes and Fixes

Mistake 1: Confusing model size on disk with VRAM requirement
A GGUF INT4 file might be 4.1 GB on disk. Engineers assume it needs 4.1 GB of VRAM. Wrong — the model must be fully loaded into GPU memory, and framework overhead adds 20–30% on top. Always calculate from parameters × bytes, not file size.

Fix: Use the formula. n_params × bytes_per_param × 1.2 gives a safe lower-bound VRAM estimate.


Mistake 2: Assuming quantization always reduces inference speed
On CPUs and older GPUs without INT8/INT4 Tensor Core support, quantized operations may be dequantized back to FP16 before the matrix multiply, negating the speed benefit while still saving memory. Engineers apply INT4 and expect 4× speedup, but measure no throughput gain.

Fix: Check whether your hardware has native INT8/INT4 GEMM support. NVIDIA Ampere and later have INT8 Tensor Cores. For CPU inference (llama.cpp), INT4 is heavily optimized via SIMD and gives real speedup.


Mistake 3: Applying one global scale factor for INT8 quantization
This destroys accuracy on models above ~6B parameters due to the outlier activation problem described above.

Fix: Use per-channel or per-tensor quantization with outlier decomposition. LLM.int8() from bitsandbytes handles this automatically.


Mistake 4: Quantizing embedding layers aggressively
Embedding tables are lookup tables, not matrix multiplications. Quantizing them to INT4 causes a disproportionate vocabulary-level accuracy loss. Most production quantization schemes (GPTQ, AWQ) skip embeddings or quantize them separately at INT8.

Fix: Always check which layers are quantized. Use --skip-modules embed flags where available.


Tradeoffs and When Not to Use Quantization

When to avoid or be cautious with INT4:

  • Fine-tuning: You cannot fine-tune an INT4 model directly. QLoRA is the workaround — it quantizes the base model to 4-bit and trains low-rank adapters in FP16/BF16.
  • High-stakes generation: Legal document generation, medical summarization, or code generation where a subtle numerical error causes a real failure. Run evaluation benchmarks (MMLU, HumanEval) before deploying INT4 in these contexts.
  • Very small models (< 3B parameters): These have limited redundancy. Quantization to INT4 typically causes visible quality degradation on instruction-following tasks.
  • Latency-critical API inference at scale: When you are serving hundreds of requests per second, activation quantization at inference time (versus weight-only quantization) adds overhead. Measure before deploying.

Interview Follow-Up Questions

  1. What is the difference between weight quantization and activation quantization? Which is harder to do accurately, and why?
  2. How does QLoRA allow fine-tuning of a 4-bit quantized model without modifying the quantized weights?
  3. Why does GPTQ use second-order (Hessian-based) optimization rather than simply rounding to the nearest INT4 value?
  4. A model quantized to INT8 shows high perplexity increase on coding tasks but not on general QA. What is the most likely cause?
  5. Explain tensor parallelism vs. pipeline parallelism for a 70B model across 4 GPUs. When would you choose each?

Practical Checklist

Before quantizing any LLM for deployment, verify these:

  • Calculate exact VRAM needed: params × bytes_per_format × 1.25 (25% overhead buffer)
  • Check target hardware for native INT8/INT4 Tensor Core support (do not assume speedup)
  • Run perplexity evaluation on a domain-representative dataset before and after quantization
  • Use GPTQ or AWQ for INT4 — never naive round-to-nearest INT4
  • Use LLM.int8() or bitsandbytes for INT8 — never a single global scale factor
  • Do not quantize embedding and LM-head layers to INT4 by default
  • For CPU inference, use llama.cpp with GGUF Q4_K_M format — it has hand-optimized SIMD kernels
  • For production multi-GPU, use vLLM with tensor parallelism and FP16 before reaching for quantization

Key Takeaways

  • Every parameter in FP32 costs 4 bytes; FP16/BF16 costs 2 bytes; INT8 costs 1 byte; INT4 costs 0.5 bytes. Multiply by parameter count to get raw VRAM, then add 20–30% overhead.
  • LLaMA-2 70B requires ~35 GB VRAM in INT4, ~70 GB in INT8, and ~140 GB in FP16 — hardware selection follows directly from this math.
  • The memory bandwidth bottleneck, not FLOPs, is why quantization speeds up LLM token generation on modern hardware.
  • Outlier activations in transformers make naive INT8 quantization dangerous above ~6B parameters; use LLM.int8() or GPTQ.
  • Never deploy a quantized model to production without domain-specific perplexity and downstream task evaluation.

Practice more questions like this on DistillPrep — GenAI Interview MCQs

Recommended Next