Gated Recurrent Unit (GRU) vs Long short-term memory (LSTM)
The paper on neuro-symbolic learning contained a reference to Gated Recurrent Units: this week I decided to understand more about them and see how they differs from LSTM. First of all you can think at using GRU or LTSM when your problem requires to remember a state for a period of time: for instance, in text analysis, predict the use of masculine/feminine or singular/plural forms; in a phrase there must be accordance of with the subject (she writes every day/they write every day).
Both the units are recurrent, at each input step they carry on a piece of information; but LSTM is by far more complex than GRU. The models equations are expressed with a different symbology, it has taken me time to adapt them to be able to do a comparison.
Let’s c be the recurrent state that will be carried on when time passes, and x be the input of a cell. Of course you may have many inputs, and many states: you must think at these variables as vectors and not just scalars. To express the formulas I will use also W as weights for the input variable x, R as weights for the c state, and b as biases – also these are vectors.
There is simplified version of GRU that has these equations:

A forget variable f is computed using the current input and the previous cell state. Here sigma is the sigmoid function, we want to have it always near 0 or near 1. When it is near 1 the cell must forget its state, for instance a new phrase begins ad you need to forget the previous subject number.
c hat stands for the future candidate cell state, and is computed using the current input, the previous cell state (but only if you must forget the previous state) and this time it is passed trough tanh to be between -1 and 1.
c the true next state is computed in this way, if I need not to forget f is near zero, so I keep the previous c.If I need to forget, I take the new candidate state c hat.
GRU learns when is the case to forget the state, otherwise they remember even for long time what happened before.
There is a more complex version of GRU that has two distinct variables, reset and update:

Should be easier now to decript it; if not there is a great video from Andrew Ng that explains very well how GRU works.
Let’s now have a look at LSTM

As you see there are many more concepts: the LTSM cell has also an output variable y, z block input and f forget rules the way the cell state is updated. The output and the cell state have to be carried on to the next computation step. Does this complexity pay’s its price? Seems sometime yes and sometimes no… but LTSM will be slower to be trained as they require much more coefficients.
About GRU and LSTM comparison I have found this article:
Mario Toledo and Marcelo Rezende. 2020. Comparison of LSTM, GRU and
Hybrid Architectures for usage of Deep Learning on Recommendation
Systems. In 2020 The 4th International Conference on Advances in Artificial
Intelligence (ICAAI 2020), October 09–11, 2020, London, United Kingdom. ACM,
New York, NY, USA, 7 pages. https://doi.org/10.1145/3441417.3441422
The problems the authors are approaching is what to recommend to an e-commerce site user when you do not have yet information on which products they like – the cold start problem. The have taken a set of page navigations (20 million pages) on a Chinese web site and trained GRU and LSTM models to predict which products will be interesting for the user.
Some cited studies report GRU training to be 20-30% faster, and the performance higher than LSTM. On theirs side the authors trained GRU and LSTM with 128 cells and using different hyper-parameters (the optimizer, batch size…). They did not find a great difference in training times, but generally GRU perform better in theirs case. The most influential hyper-parameter was the optimizer: better to use RMSProp for them.
Leave a comment