Weight initialization in neural network training
The neural network training algorithm changes the weigth associated to neural inputs an biases so that the network learns to reproduce the desired output. But how do you initialize all these weights before starting the training? Is there a way to initialize them to speed up the learning?
From my previous reading I understood that neural networks implements non convex functions – functions that have multiple local minimums. These minimums can be a trap: if the training algorithm does not explore enough of the parameter space, the model will not evolve and will produce sub-optimal results. Ideally we would like that the training discover the absolute minimum of the function, so that it can produce the best possible results. We search for a minimum because it represent the minimum possible difference between predicted and training results.
The role of parameter initialization is therefore to chose where the minimum search will start in the parameter space. If we start the search close to the absolute minimum, the trainin will be quick and provide the best results. If we start from the wrong place, the training could even not converge at all.
But is it really possible to do a good initialization? In the end we do a complex and long training because it is not possible to figure out the good model directly. This week I read this paper:
A Weight Initialization Method Associated with Samples for Deep Feedforward Neural Network
Yanli Yang, Yichuan He
ICCDE ’20, January 4–6, 2020, Sanya, China © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7673-0/20/01
https://doi.org/10.1145/3379247.3379253
The authors acknowledge that the most widely used algorithm to initialize the parameters is just a random choice. In the end if you do not know where the minimum is any place is a good starting point. They also speaks of other methods like using stacked auto encoders or deep belief network, but really too quickly in a short paragraph; they did not satisfy my curiosity.
The authors suggest to initialize the Fast Forward Neural Network using this procedure:
- First of all you extract randomly the values for all the parameters; let’s call this set of parameters W0 – weights at time 0
- You then train the network for just one epoch. At this point the parameters will be evolved into W1 – weights at epoch/time. Notice that these weight will incorporate some information coming from the training samples, not too much as is just 1 epoch.
- Restart the training with W* = a W0 + b (W1 – W0), where a and b are two coefficients, let’s say a=0.9 and b=0.3
This W* has therefore a random component and a data driven component. From theirs experiments this allows to reduce of hundred of epochs the training. This is amazing, as W1 comes from just one epoch training, actually it puzzles me. Why just continuing the traning from W1 is worst than restarging from W*? The paper does not provide a teorical explanation, but just concludes telling that more experimentation is needed. It reminds me about curriculum learning: you start training the model with simpler examples and then you move to complex things, trying to move the parameters search more clores to the optimal minimimum. With this initialization you maybe start moving in the right direction, but it seems just an intuition. I am quite unsatisfied of the lack of theory behind the results presented, I will look for more refences to this techinque in the following weeks to seek for some confirmation.
Leave a comment