Deep Learning Practice

Live Engine

Select Topic

hardWeight Initialization

You train a deep residual network with 110 layers. Using standard Kaiming init for all conv layers, you observe that at initialization, the forward pass variance explodes: the output feature map has values 10^15× larger than the input after 110 layers. Despite having skip connections, the residual path accumulates variance. What initialization fix specifically addresses deep residual networks, and how does "zero-init of the last BN's γ" solve the variance explosion?

Live Engine

Select Topic

hardWeight Initialization

You train a deep residual network with 110 layers. Using standard Kaiming init for all conv layers, you observe that at initialization, the forward pass variance explodes: the output feature map has values 10^15× larger than the input after 110 layers. Despite having skip connections, the residual path accumulates variance. What initialization fix specifically addresses deep residual networks, and how does "zero-init of the last BN's γ" solve the variance explosion?