A team migrates their NLP pipeline from a character-level tokenizer to a BPE (Byte-Pair Encoding) tokenizer and notices the model trains faster and achieves lower perplexity on the same dataset. Their intern attributes this entirely to the larger vocabulary size. What is the more precise mechanism behind BPE's advantage over character-level tokenization for language modeling?