Live Engine
Select Topic
easyText Preprocessing
A pipeline tokenizes the sentence "Dr. Smith lives in Washington D.C." using whitespace splitting. The downstream model receives 7 tokens. A second pipeline using a sentence tokenizer first produces 1 sentence, then word-tokenizes it. Why does naive whitespace tokenization fail here compared to the rule-based sentence tokenizer?