Inference Optimization | GenAI & LLMs

Live Engine

Select Topic

easyInference Optimization

A developer deploys a 7B LLM for an API service. They notice that generating the first token of a response takes 80ms (prefill latency), while each subsequent token takes 12ms (decode latency). Users complain that short responses feel slow despite the total latency being acceptable for long responses. A colleague says: "The first-token latency and per-token latency have completely different bottlenecks — optimizing one doesn't optimize the other." What are the two different bottlenecks, and why does the KV cache matter for one but not the other?

Live Engine

Select Topic

easyInference Optimization