A team wants to fine-tune LLaMA-2-70B for a domain-specific task. Full fine-tuning requires 560GB of GPU memory (70B params × 4 bytes × optimizer states × gradient copies). They have 4×A100 80GB GPUs (320GB total). A colleague proposes LoRA. An engineer asks: "How does LoRA allow us to fine-tune a 70B model on 320GB of GPU memory?" What is the precise memory reduction mechanism?