You are fine-tuning via QLoRA. The base model weights are stored in 4-bit NormalFloat (NF4). During the Forward Pass, PyTorch matrix multiplication fundamentally cannot operate on 4-bit integers crossed with 16-bit activations. What specific hardware or algorithmic trick allows QLoRA to function?