A developer deploys a 7B LLM for an API service. They notice that generating the first token of a response takes 80ms (prefill latency), while each subsequent token takes 12ms (decode latency). Users complain that short responses feel slow despite the total latency being acceptable for long responses. A colleague says: "The first-token latency and per-token latency have completely different bottlenecks — optimizing one doesn't optimize the other." What are the two different bottlenecks, and why does the KV cache matter for one but not the other?