A deployed text classification model suddenly shows increased loss across all classes simultaneously. Infrastructure metrics are normal (no latency spike, no error rate). Feature distribution PSI scores are below alert thresholds. The team suspects a silent failure but can't identify the cause. What diagnostic framework systematically identifies the root cause, and what class of monitoring gap does this reveal?
A Redeploy the model — increased loss without clear cause is always a deployment artifact B Systematic silent failure diagnosis: (1) Layer-by-layer isolation — check raw input text samples (has the upstream text extraction/cleaning pipeline changed? Unicode normalization? HTML stripping added unexpectedly?); (2) Tokenizer validation — if the model uses a tokenizer, verify the tokenizer version hasn't been updated silently (tokenizer upgrades can change token IDs for common words, causing systematic prediction distribution shifts); (3) Preprocessing regression test — run a fixed set of test inputs through the full preprocessing → model pipeline and compare outputs to a saved baseline; a hash mismatch on deterministic test cases proves a preprocessing change; (4) Model file integrity — verify the deployed model file hash matches the expected hash (deployment systems occasionally deploy wrong artifact versions); (5) Batch normalization/LayerNorm mode — verify model is in eval() mode (not training mode) in PyTorch; training mode uses batch statistics, causing non-deterministic outputs under variable batch sizes; the monitoring gap revealed: standard monitoring (feature PSI, accuracy) misses silent infrastructure mutations (tokenizer upgrades, preprocessing code changes, batch norm mode bugs) because they affect the model in ways that don't show up in input feature distributions C Increase training data volume and retrain — the model may have drifted D The increased loss is from seasonal patterns — wait for the season to pass