A team training a 7B parameter LLM has a budget of 1.4×10²¹ FLOPs. Their initial plan was to train on 200B tokens. A colleague who read the Chinchilla paper says: "You're significantly over-parametrized for your compute budget." What does the Chinchilla scaling law predict is the compute-optimal allocation for this compute budget, and why does it matter?