What is “Dropout” in neural networks

Neural networks are composed of layers of neurons arrays, and are trained using back-propagation. Since these structures are composed of thousand of elements and millions of connections it is hard that structure and specialization of some neuron sets emerges during the training. Often neurons co-adapt, they provide the desired output but considering odd relations on input data. The “dropout” technique has been introduced in 2014 and promotes neurons independence and specialization, improving the prediction performance.

I am referring to this paper:

Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov
Journal of Machine Learning Research 15 (2014) 1929-1958

It is easy to explain what dropout is: suppose you have a 2 layer network, one composed of 100 neurons and the other of just 10. During a normal training all the parameters of these neurons are tuned together. With dropout instead, you turn on or off each neuron independently from the others during all the training time. For instance the first samples minibatch will be used to train neuron 1, 7,8,14… while the second minibatch will be used for 2,5,7,11,14… It is like at each time you train a different model, but this is not completely true because a neuron keeps its own parameters during all the training. Let’s p_1=80% be the probability of keeping active one neuron in layer 1, and p_2=50% be the probability of keeping active one neuron in layer 2. In this case you do not train all the 100 neurons by 10 neurons for all the training, but at each step you train a subnetwork of 80 by 5 neurons.

On the left the original network, on the right one dropout step: only 50% neurons in layer 2 (top) and 80% in layer 1 (bottom) are kept. The white neurons will not participate to the training during this step.

There is one point to be considered about the parameters learned. At the training end, how do you use the model? You simply turn on all the units, BUT YOU SCALE THE LEARNED WEIGHTS WITH P, this because you trained many many smaller models and theirs weight must be adjusted to maintain the right proportions. The final model has more neurons than those used during the training, in the example above was 100*10 instead of 80*5. You are creating a larger model by combining together an ensemble of many smaller ones. This differs from the usual ensemble approach, because the model trained here have the same architecture and shares parameters between them. In classical ensembles, you put together model with different architectures trained independently. One note about Restricted Boltzmann Machines / Deep Belief Networks : you use p just on the hidden layer, not on the visible one, see the paper for details.

What are the effects of dropout on the trained model? The best thing I can do to explain it is copy paste a couple of picture from the original paper

Taken from page 16 of the original paper. On the left the some image filters learned with classical training, on the right the filters obtained with dropout. The training was about recognizing handwritten digits, the MNIST data set.

In the picture above you clearly see that the dropout filters are very specialized, they work on specific part of the image. The classical learning filter instead are blurry and do weird correlation between distant pixels.

One another important point is about activation and model sparsity: see for instance picture 8 in the original paper

By applying the dropout we reduce the number of neurons used to compute the results: less neurons are involved, with higher activation. This is beneficial because it will make the model more robust – one output will depend on less inputs, so a small change in something unrelated will not make the prediction change. Also the model will consume less energy to compute the result, because many neurons will not work and produce just 0 output.

The suggested value of p is 80% for input layer and 50% for the hidden layers, but this is just another hyperparameter and you will need to do some experiments to tune it.

So why not applying dropout each time? Training will require more time! You are training many different models and putting them together, in the end you will have a better model, but it will take 2 or 3 more time.

If you read the original paper you will find details on the math behind this approach. Activating randomly the neurons can be seen as a Bernoulli extraction, and you could use other distributions like the Gaussian instead, Averaging the models together can be done in other ways than just scaling the parameters, but this method works quite well in the end. All the method can be seen as a regularization approach, like adding noise in denoising auto encoders. You will also learn about interaction with pre-trained model (see my previous post on deep belief networks), and see many many examples on where drop out can be used and how much it increased the performance.

Written by Giovanni

May 14, 2023 at 6:56 am

Posted in Varie

Giovanni Bricconi