How a simple activation function changed deep learning
Activation functions are at the heart of each neuronal unit: once the weights multiplied with the inputs and a bias added, this linear value is passed through a nonlinear function, and this become the input value for the next neural network levels. In the past just a couple of functions were used, the sigmoid and the hyperbolic tangent: these two functions are quite smooth on the borders and at some point it has been discovered that a much simple function could do the work better. This post resumes the content of “Deep Sparse Rectifier Neural Networks” appeared in January 2010 into the Journal of Machine Learning Research, were Xavier Glorot, Antoine Bordes and Yoshua Begio changed again the deep learning landscape.
Below you see the definition and a picture of the most common activation functions:


The sigmoid gently goes from 0 to 1, while the tanh goes from -1 to 1; I have already encountered them while reading of Long Short-Term Memories (LSTM) where they are extensively used to block or propagate memories of previous events. You see there are a lot of exp functions to calculate, by converse the hinge is just trivial and does not require complex hardware. The hinge just says “below this hyperplane I do not care what happens”.
Coming back to the paper, in 2006 Hinton introduced his layer-by-layer initialization procedure, that allowed training deep networks from a good starting point. The authors were investigating why the usual training procedure was not working well, and Hinton’s one was better. They decided to experiment with the simple rectifier function, and they figured out that it was giving very good results; so good that in many cases the pre-training procedure was not needed anymore.
One caveat is that rectifiers need also an L1 normalization, as the function to the right of 0 is purely linear the model can be affected by unbounded-activation problems: the coefficients can grow and grow in values as there is no penalty for using very high values for coefficients. The regularization keeps the learned parameters low penalizing models with very high coefficients. Sigmoids and tanh in that case saturates to 1.
It is not a limitation to have an asymmetric function as the rectifier: in case something symmetrical needs to be learned, it is possible to do it with a couple of rectifiers with the sign swapped. Some more neurons can be needed, but with a much simpler activation function. The paper does interesting comparisons with the “leaky integrate-and-fire” model used to represent real biological neurons activation function: also this model presents an asymmetry at 0, thing that makes tanh models implausible – they force an unnatural antisymmetry in the learned model.
The paper also describes another point: the percentage of neurons producing something different than 0 is very low in nature (1 to 4%) while with sigmoids all of them are calculating something more or less at 0.5. A model with rectifiers is much closer to what happens in nature: when something is not interesting 0 is propagated, no energy will be required, clear paths are selected in the network. The sparsity is an interesting point, also because it is a sign of disentangling: only some changes to some variables affects the output, with very entangle models instead a small change in whatever variable affects the output. A disentangled model is therefore more robust and explainable, just some paths are taken to compute the result.
Another point is the gradient vanishing problem, as the rectifier is not smoot at all, the gradient will be propagated to back to the previous network layers, and training will be possible even with many many layers stacked. A potential problem, by converse, is that when the rectifier outputs 0, there is no gradient at all propagated – it is just a flat zero line! But after some experiments with a comparison with a smoothed version of the rectifier, they realized that this is not really a problem. In some architectures, like denoising autoencoders for unsupervised learning – some the softplus function or a normalization will be needed to avoid this 0-ing out problem; see the paper for details.
The authors report the result of many experiments. On image recognition, the rectifiers have been able to learn a very good model without the pre-training procedure, and this creating a good sparse model. In tasks such as sentiment analysis instead, the pre-training is still useful even though if many classified samples are available it becomes less useful.
The authors have done a great job in making this complex matter clear and understandable, I really encourage you to read the original article if you have time.
Leave a comment