Live Engine
Select Topic
easyActivation Functions
A sigmoid activation outputs values in (0,1). You use it in a hidden layer of a deep network with 10 layers. During training you observe that gradients in the first 3 layers are approximately 10⁻⁶ while gradients in the last 3 layers are approximately 0.1. What causes this disparity and what is the standard fix?