After fine-tuning a GPT-2 model (117M parameters) on customer support data, you evaluate on the original GPT-2 benchmarks (HellaSwag, WinoGrande) and find performance dropped significantly. What is this phenomenon, and what is the simplest architectural fix that prevents it while still allowing task-specific adaptation?