DistillPrep
Python
GenAI
GenAI Frameworks
NLP
Deep Learning
Machine Learning
ML Libraries
Statistics
SQL
MLOps
Cloud
System Design
Blog
Learn
Practice
Test
Live Engine
Select Topic
Introduction To Neural Networks
(15)
Neurons And Perceptrons
(15)
Activation Functions
(15)
Forward Propagation
(15)
Loss And Cost Functions
(15)
Backpropagation
(15)
Optimizers
(15)
Ann Architectures
(15)
Regularization And Normalization
(15)
Weight Initialization
(15)
Cnn Architectures
(15)
Rnn Lstm Gru
(15)
Attention And Transformers Dl
(15)
Self Supervised And Contrastive Learning
(15)
Graph Neural Networks
(15)
Transfer Learning
(15)
easy
Attention And Transformers Dl
Scaled dot-product attention computes: Attention(Q, K, V) = softmax(QK^T / √d_k)V. Why is the scaling factor √d_k used, and what happens without it?
A
√d_k scaling prevents numerical overflow in the softmax input
B
Without scaling, the dot products Q·K^T have variance proportional to d_k (each component adds variance ~1, so sum of d_k components has variance d_k). For large d_k (e.g., 64), the dot products have std ≈ √64 = 8. Large logits push softmax into saturation regions where the gradient is near zero (softmax becomes near-one-hot). The √d_k factor normalizes: QK^T/√d_k has unit variance. Without it: very large d_k → vanishing gradients through the attention softmax during training
C
√d_k scaling is applied to reduce the computational cost of the softmax operation
D
The scaling prevents the attention weights from summing to values greater than 1
Previous
Back
Next