Long Short-term Memory (LSTM)
After reading the short term weather forcast paper, I was curious about the LTSM and I dedided to learn about it. On wikipedia I have found a reference to this article:
LSTM: a search space odyssey. Klaus Greff, Rupesh K. Srivastava, Jan Koutnìk, Bas R. Steunebrink, Jurgen Schmidhuber. DOI: 10.1109/TNNLS.2016.2582924
It was nice to see that all the authors are from an university really close to my home town! The cv of Jurgen Schmidhuber is really impressive too.
Probably was not the best pick as article because it is not really about LSTM inception but it report on an huge experimentation of different variants, reporting and comparing theirs performances. By converse it has been interesting to see how they managed to fairly compare all these variants, and how they did the hyperparameters tuning.
But what are the LSTM neural networks for? Looking at the 3 use cases used to compare variants performance I see:
- recognising 61 different sound humans can pronounce. The used the TIMIT speach corpus
- recognising handwritten text. The IAM Online Handwriting Database contains hadwritten vector images and the ascii text corresponding to the picture. It is not just a picture as it contains also information on when the pen has been lifted.
- J. S. Bach Chorales, the machine learning model has to learn how harmonies are created. (https://axon.cs.byu.edu/Dan/673/papers/allan.pdf)
So all tasks where time evolution is very important.
The LSTM block itself it is really complicated! First of all there is not just input and output , there are four different input types: block input, plain input, forget gate input, and output gate input. These are taking as input not only the current input, but also the output at t-1. Internally a cell value c is computed and propagated with time delay to the forget gate and to the input gate, and without delay to the output gate; these connections are called peepholes. Finally there is an output block. Sorry, complex and difficult to describe.
The idea is that a memory cell maintains a state over time, and nonlinear gates regulates how information flow into and out of the cell. I reproduce here a similar picture to the one in the article to give an idea, but as the article is available on arxiv it is much better to check the original one directly at page 2.

In formulas:
z_t = nonlin(w_z x_t + r_z y_t-1 +b_z)
i_t = nonlin(w_i x_t + r_i y_t-1 + dot_product(p_i, c_t-1) +b_i)
f_t = nonlin(w_f x_t r_f y_t-1 + dot_product(p_f,c_t-1) + b_f)
c_t = dot_product(z_t, i_t) + dot_product(c_t-1, f_t)
o_t = nonlin( w_o x_t +r_o y_t-1 + dot_product(p_o, c_t) +b_o)
y_t = dot_product(nonlin(c_t) , o_t)
Where w and r are matrices of coefficents; b and p are real coefficients. So complex!
Forget gate is there to let the network learn when to reset itself; my understanding, concerning the handwritten example, the network can learn when a character is completed and a new one is starting, for instance when the text is written in italics. The peephole connextions were introduced to improve learning precise timings.
With a so complex schema it is clear that you can make many small changes to it, and question if that is an improvement or not. That is precisely the scope of the article: they compared the original model with 7 variants.
Since each variant had different schemas and different hyper-parameters, they had firt of all to identify the good seetting for each model and for each set of training data: an huge work but needed to ensure that you compare the best possible results that you can achieve with each model.
The variants compared are: no input gate (always =1), forget gate always 1, output gate always 1, no input activation formula (z(x) = x), no output activation formula (h(c_t) = c_t), coupling input and forget gate together f_t = 1 – i_t, removing peepholes, and full gate recurrence (many more coefficients to feed back in input values of i, f and o at t-1)
Often these changes are not so important, after the training the different networks reach more or less the same performances. This is not true if you remove the output activation function or the forget gate. So from the results those 2 components are very important. Coupling input and forget gates together does not hurt perfomances but reduces the parameters to learn, so seems a good idea to try it. The same for removing the peepholes. Notice also that the very complex full gate recurrence does not bring advantages.
Another interesting part of the article is about how they have tuned the hyper-parameters such as learning rate, momentum, adding gaussian input noise… They refer to the fANOVA framework, and it has let them taking some conlusion on how to tune these parameters.
The learning rate is the most important parameter, start with an high value and decrease it until performance stops increasing. Smaller values will not do better. Adding gaussian noise to the inputs was not useful and they have found that it is not requirend to invest much time in tuning the momentum, as it does not improve significatively the performance.
They also worked on understanding the influence of one hyperparameter on the other, and luckily they seems quite intependent so they can be tuned individually.
Leave a comment