Live Engine
Select Topic
easyBert And Variants
BERT's Masked Language Model (MLM) pretraining randomly masks 15% of tokens and trains the model to predict them. An engineer asks why BERT doesn't just mask all tokens (100%) and predict the whole sequence at once. What fundamental issue would this cause?