A team switching from an LSTM-based translation model to a transformer-based one notices training time drops dramatically even though they increased the dataset size. Their hardware is unchanged. Which structural property of the transformer is the *primary* cause of this training speedup?