Long short term memory (LSTM) – understanding more

Reading just one article on this subject was really not enough for me to undersand. There are still many points not clear and I decided to spend more time on LSTM: read another article, and also search on internet more high level explanations.

I have read the other article proposed by wikipedia on the subject:

Learning to Forget: Continual Prediction with LSTM. Techinical report IDSIA-01-99, January 1999. Felix A. Gers, Jurgen Schmidhuber, Fred Cummins.

The article is from 1999, and it describes why they introduced in the LSTM model the Forget Gate, and explain how this change made the model much more powerful. The reasoning is that, in a recurrent neural network, the output error has to be back-propageted to tune the weights: this becomes a problem considering how the state is memorized.

The cell state in the picture is just a delay, the value at time t will be used at time t+1: if it is updated at each step if will eventually see its value explode due to the sum of errors to track from the beginning of the time. Also the task you have to solve may not require to remember everithing, ad it is legitimate at some point to forget what has happened before. When I have to decide to turn left or right, it has no importance to remember if yesterday a the same time I was turning left, it only matters what I am doing on the current trip, not yesteday’s trip.

This is much more clear considering the benchmark they have chosen, the Embedded Reber Grammar (see this explanation https://cnl.salk.edu/~schraudo/teach/NNcourse/reber.html)

Visit the original page with the picure https://cnl.salk.edu/~schraudo/teach/NNcourse/reber.html

So each message in this grammar can start with BT or BP. If it starts with BT it has to finish with TE, if it starts with BP it has to end with PE. The machine learning model has then to be able to remember this information until the end of the string, which has a variable length, because as you can see in the other blocks of the grammar there are loops. Then as you can see the same grammar block is repeated in the top and bottom branch, but still contains loops, so it is needed to remember what happened in the previous inputs. In the paper they explain that, if you generate 8000 strings, there is a big change you will get one string of at least 50 symbols, so the LSTM will need to remember the BP or BT sequence for a lot of time!

The LSTM model with forget gate is not only able to predict the next symbol of the input stream for the embedded reber grammar, but also the even more complex problem of recognizing sequences of ERG messages. They call this problem the Continuous Embedder Reber Grammar CERG, where one ERG strings follows the other without interruptions. The forget gate will learn whent an EGR sequence is complete and a new one has to start, and will reset the memory cell state, in this way the previous message will not have effect on the current phrase.

So at the beginning of the training the forget gate will be in a state that is near to 1, which means keep the memory untouched, and the memory c will be inizialized with 0. At some point during the training, the forget gate will learn the correct combination of input and previous state that must trigger a reset, and in this case the forget gate will give 0, and make the memory start from scratch. The memory has a formula like c_t+1 = … + forget_gate * c_t, so when the forget gate is 0 it clears the memory.

The role of the other gates in the schema, the input and the output gates, is similar. Theirs value goes from 0 to 1 so that, when the value is close to zero, irrelevant noise and input will not reach the cell changing the state.

To make things simple, I have spoken of c_t but actually nothing prevent introducing many c_1t c_2t etc and have many memory cells. This is actually what is done in the paper, where the model to recognize sequences of ERB strings is composed of 4 LSTM each of them having 2 cells c1 and c2.

If you want a more clear an long explanation I sugget to read https://medium.com/analytics-vidhya/lstms-explained-a-complete-technically-accurate-conceptual-guide-with-keras-2a650327e8f2 where all the pieces are introduced step by step, and the usual picture of LSTM element is decomposed and explained.

Another nice article to read is this one http://colah.github.io/posts/2015-08-Understanding-LSTMs/,

Still after all of this I ask myself how this kind of models have to be sized, if there are some hints on how many c cells to use or h elements to consider. I will continue reading to see if I find something interesting.

Written by Giovanni

October 2, 2022 at 8:05 am

Posted in Varie

Giovanni Bricconi