Giovanni Bricconi

My site on WordPress.com

Archive for February 2023

Again on Gated Recurrent Units (GRU)

leave a comment »

The article and the explanations I found last week were quite clear, but I wanted to read the paper that introduced GRU:

Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. Association for Computational Linguistics.

I was expecting to see a paper telling all the good things a GRU can do, but I was quite surprised to see a paper on automatic language translation. The GRU has been introduced in that context, and then it has become a useful tool in many other scenarios. 

It has been a good opportunity to learn something about automatic language translations, at least how it was done 10 years ago. At that moment the most effective way of doing translation was using Statistical Machine Translation (SMT): these systems need tens of gigabytes of memory to hold the language models. In the paper a neural approach was instead used: neural networks can learn how to map a phrase from language A to language B. According to the authors, the memory required by the neural models is much smaller than that SMT requires. The neuro-translation approach was not introduced in this paper, but by some other authors a couple of years before.

The translation works in this way: you have as input a sequence of words in a source language (English). The input phrase has a variable length, but it ends with a dot or a question mark, etc. A neuro-encoder module maps this phrase to a fixed-length representation, a vector of size d. This vector is used by a neuro-decoder module that maps it to a sequence of words in the target language (French). It’s like all the possible knowledge in a phrase could be mapped onto a finite set of variables z_i…z_d and this vector could be universal to all the languages: a fascinating idea! Of course, if the d size is small the machine will not be able to work with long and complex phrases. This is what the authors have found: the translators are quite good with 10 to 20 word phrases, then the performances start to decrease. Also, whenever the machine encounters a new word it does not know, it will not be able to produce a good translation. The machine has to be trained with pairs of phases in the source and target languages, therefore new words are a problem.

Luckily for English and French speakers, there are good corpora that can be used to train the neural networks: I hardly doubt it would be possible to train in this way a German to Greek model, or whichever other pair of European languages excluding English. This is just another example of cultural bias in AI.

Another interesting aspect is how to evaluate the translation quality: a method called BLEU is the standard in this field. I had a quick look at “understand the BLEU Score“, from Google Cloud documentation. You need to have a source phrase and a reference translation, then the candidate translation is compared to the target phrase searching single words, pair of words, up to groups of 4 words. The groups must appear both in the candidate and in the reference phrase; there is also a penalty if the candidate translation is too short. An understandable translation should have a score of 30 to 40, a good one is above 40. The results in the paper score about 30, when the phrases are not too long.

Coming back to the paper, they actually introduce “Gated Recursive Convolutional Neural Network”: so there are gated units as I expected, but I did not expect to see them composed into convolutional networks. There are two reasons for this: the first one is that the input phrases have variable lengths, and to handle this the GRU units are composed in a sort of pyramid. The second one is that convolutional networks share the same parameters in the units, reducing drastically the number of parameters to be learned. Citing the authors “…another natural approach to dealing with variable-length sequences is to use a recursive convolutional neural network where the parameters at each level are shared through the whole network (see Fig. 2 (a)).” In the picture below you can see an example of how the GRU are laid out:

Figure 6a in the paper

Another interesting aspect is that in the paper the GRU units have a left and right input, the left word and the right word. The hidden state h (or c as cell state) is updated by choosing from the left input, or the right input, or it is reset to a combination of the 2.

The coefficients in the first equation are normalized so that the next c state depends mostly on one of the 3 terms. The j coefficient is a positional counter left to right, the t coefficient is as usual the time evolution.
As you can see in the pyramid example above, some lines between the GRU units are missing: their W coefficients are too low with the current input, and that contribution is not taken into account. In a certain way, the structure is parsing the phrase “Obama is the President of the United States.” as “<Obama, <is the President, of the United States.>>”. The structure has been derived uniquely from the training, and nothing has been done to instruct the system about English grammar! Very remarkable.

In the paper nothing is said about the decoder unit, so I don’t know if it just a plain neural network or a recurrent one. In the end the state vector Z should represent the phrase meaning, and there should be no need of recur on the previous state.

Written by Giovanni

February 28, 2023 at 11:44 am

Posted in Varie

Gated Recurrent Unit (GRU) vs Long short-term memory (LSTM)

leave a comment »

The paper on neuro-symbolic learning contained a reference to Gated Recurrent Units: this week I decided to understand more about them and see how they differs from LSTM. First of all you can think at using GRU or LTSM when your problem requires to remember a state for a period of time: for instance, in text analysis, predict the use of masculine/feminine or singular/plural forms; in a phrase there must be accordance of with the subject (she writes every day/they write every day).

Both the units are recurrent, at each input step they carry on a piece of information; but LSTM is by far more complex than GRU. The models equations are expressed with a different symbology, it has taken me time to adapt them to be able to do a comparison.

Let’s c be the recurrent state that will be carried on when time passes, and x be the input of a cell. Of course you may have many inputs, and many states: you must think at these variables as vectors and not just scalars. To express the formulas I will use also W as weights for the input variable x, R as weights for the c state, and b as biases – also these are vectors.

There is simplified version of GRU that has these equations:

A forget variable f is computed using the current input and the previous cell state. Here sigma is the sigmoid function, we want to have it always near 0 or near 1. When it is near 1 the cell must forget its state, for instance a new phrase begins ad you need to forget the previous subject number.

c hat stands for the future candidate cell state, and is computed using the current input, the previous cell state (but only if you must forget the previous state) and this time it is passed trough tanh to be between -1 and 1.

c the true next state is computed in this way, if I need not to forget f is near zero, so I keep the previous c.If I need to forget, I take the new candidate state c hat.

GRU learns when is the case to forget the state, otherwise they remember even for long time what happened before.

There is a more complex version of GRU that has two distinct variables, reset and update:

Should be easier now to decript it; if not there is a great video from Andrew Ng that explains very well how GRU works.

Let’s now have a look at LSTM

As you see there are many more concepts: the LTSM cell has also an output variable y, z block input and f forget rules the way the cell state is updated. The output and the cell state have to be carried on to the next computation step. Does this complexity pay’s its price? Seems sometime yes and sometimes no… but LTSM will be slower to be trained as they require much more coefficients.

About GRU and LSTM comparison I have found this article:

Mario Toledo and Marcelo Rezende. 2020. Comparison of LSTM, GRU and
Hybrid Architectures for usage of Deep Learning on Recommendation
Systems. In 2020 The 4th International Conference on Advances in Artificial
Intelligence (ICAAI 2020), October 09–11, 2020, London, United Kingdom. ACM,
New York, NY, USA, 7 pages. https://doi.org/10.1145/3441417.3441422

The problems the authors are approaching is what to recommend to an e-commerce site user when you do not have yet information on which products they like – the cold start problem. The have taken a set of page navigations (20 million pages) on a Chinese web site and trained GRU and LSTM models to predict which products will be interesting for the user.

Some cited studies report GRU training to be 20-30% faster, and the performance higher than LSTM. On theirs side the authors trained GRU and LSTM with 128 cells and using different hyper-parameters (the optimizer, batch size…). They did not find a great difference in training times, but generally GRU perform better in theirs case. The most influential hyper-parameter was the optimizer: better to use RMSProp for them.

Written by Giovanni

February 18, 2023 at 3:33 pm

Posted in Varie

World Artificial Intelligence Cannes Festival WAICF

leave a comment »

This week the WAICF has been held in Cannes: it has been 3 days marathon, with many expositors and a lot of presentations. The very good new was that it was possible to register for free, with some limitations; I have been then able to attend part of the presentations and visit many stands without having to pay for a ticket. There have been so many presentations that my brain oveflowed, and it is very difficult to report all the things I saw.

In general most of the presentation have cited ChatGPT at least few times: ChatGPT has contributed to change the way people look to AI. Before it, AI was far from people imagination: now most of us have realized that AI is here to stay and that it will have a real impact on hour lives. Of course not all that glitters is gold, and despite a lot of limitations, people start thinking at AI positive or negative implications on our lives.

There is the concrete fear that AI algorithm will be used against common people interests: many will lose their jobs because “intelligent” machine can replace them, also creative jobs that everybody believed not automatable. For instance, in a game development company, many creative artists and developers have reported they fear what AI can do to them: stable diffusion can quickly generate many impressive images, and ChatGPT can spot many bugs in source code… But this is not the only source of concern: AI is already applied in many services and can be easily used to take advantage of the customers. Is the recommended hotel just the one that optimize the web site profit, is some sort of bias introduced because you have been identified as belonging to a user group that can pay more the same service? For instance an algorithm, for a female audience, could increase lipstick prices and reduce drill prices – in the end the averages prices could be equitable, but it is just cheating.

The ethical problem was approached in many presentations. Somebody can be interested in visiting the AI for food site or Omdena: if AI can be used for bad purposes, it is also true that you can use it to do great things like United Nations projects, promote gender equality or predict cardiac arrests.

In his talk Luc JULIA, chief scientific officer of Renault, was comparing AI to an hammer: you can use it for good purposes – put a nail in a wall, bad purposes – hit your noisy neighbor, but there is a very important thing: the hammer has an handle. Its our responsibility to hold the handle and decide what we want to do with it. My consideration, it is quite unfair if holding the handle are just big companies – hence the need of some sort of regulation.

How to regulate by law AI usage? What should be encouraged, what should be forbidden, and how a customer can fight with a big company if they have been somehow been damaged? Proving that a system is discriminatory requires a lot of effort on the offended part: sample the system behavior, identify in which context it is unfair, demonstrate that the damage is not marginal but it is worth a judge attention… Creating a new law also can take years and the evolution in this domain is so fast that there is a concrete risk the law is born already obsolete. Not the same level of regulation must be applied everywhere: an autonomous vehicle requires much stricter regulation than a recommender system, of course. The legislator has to choose the right tool, just a recommendation, some incentives if some behavior are respected, or fines in other cases.

Coming back to the hammer metaphor, how good is the tool that we have? Stuart Russel, from Berkeley University, did a speak on General Artificial Intelligence, reporting many cases where it is possible to make ChatGPT or other systems fail just because they don’t really understand the meaning behind questions. For instance one reporter (Guardian) asked a simple logic question about having 20 dollars and giving 10 to a friend, asking how many dollars were there in total: according to ChatGPT there were 30 dollars. There are new paradigm that we can explore in future to build better systems like “probabilistic programming” and assistance games theory. The last one is very fascinating: what is the risk of having machines take control over the hammer handle, as they can evolve so fast and accumulate a level of knowledge no human can achieve? A wrongly specified objective can lead to disaster, but in assistance games the machine is just there to help their master and does not know its own real goal: so it just try to help and not to interfere too much.

Another subject of interest was the bias: we are training AI systems with real data. These systems are built trying to optimize some function, for instance a classifier is trained to assign a class to a sample in the same way it happens in reality; but what if reality is unfair? We all know women’s salaries are lower that men’s, we definitely do not want that AI systems perpetrate the same injustice when used in production. Nobody has today a solution for this, and it is indeed a good business opportunity. No company wants to have a bad reputation because they applied a discrimination, and few companies have the capacity of developing themselves a bias filtering. As AI will be democratize there will be the need of standard bias cleaning application, in many contexts.

The same applies to general AI models: AI is being democratized, some companies or organizations will be able to craft complex models, all the others will in the end buy something precooked and apply it in theirs applications. A real AI project requires a lot of effort in many phases: data collection, data verification, feature extraction, training, resource management (training require complex infrastructures), monitoring. This is the reason all the IT players are focusing on creating AI platforms were customers will be able to implement theirs model (SAS Viya, Azure machine learning, IBM Watson…)

Another issue with bias is the cultural bias: chat GPT is trained on English contents, but not everybody is a fluent English speaker. Which solutions exists for other languages? Aleph Alpha was presenting their model working on multiple languages (German, French, Italian…). One interesting thing I saw is that it does a big effort in trustability; the presenter showed it was possible to identify why an answer has been chosen in the source text used for training. One can decide if the answer was just randomly correct, or if it was correct for a good reason. Theirs system is also multi-modal, you can index together text and images, and text is extracted also from the images: nearly all documents have both so it is interesting to have one system that can work on that.

Trust is a word that was often used in the presentations. If we start having AI algorithms applied in reality, we want them to be somehow glass-box algorithms. A regulator must be able to inspect it, when needed, to understand if something illegal has been done (for instance penalize employees that have taken too many sickness leave days). The company that is developing it must understand how much it is reliable, and if it is using the information we expect it to use: if a classifier is deciding a picture represents an airplane just because there is a lot of blue sky, well it is not a good tool. We also must be sure that personal or restricted information do not leak into an AI model, that then is reused without our permission, or with a purpose we do not approve: our personal information is an invaluable asset.

The Credit Mutuel bank is developing many project with AI, what they reported is interesting: they conducted some experiments in humans evaluating a task alone, AI alone on the same task, and then 2 other scenarios where humans could decide if to use AI, or where they were forced to se AI propositions. The most successful scenario was the last one: the AI hammer is useful and we need to realize that it is here to stay. We also need to understand who is accountable of an AI system: people have to know what they can do and what they cannot do with AI, and this must clear and homogeneous inside the same company. When you have multiple project you realize that you need a company policy on AI. Somebody from Graphcore also suggest to focus AI projects on domains where the cost of error is low: use AI to automate writing summaries from long documents is much less dangerous than developing an airplane autopilot. they are just two different planets in term of accountability.

To conclude I would like also to cite Patricia REYNAUD-BOURET work on simulating how the human brain is working. She is a mathematician, working on simulating real neuron work: she described us how neuron works, that it is important how many stimuli are received in a time range, and this makes one neuron be activated and propagate a signal to other neurons. We have about 10^11 neurons in our brain, but is it possible to simulate theirs activity on a computer, maybe even a laptop? With some mathematical assumption it is possible to do something at least for specific brain area. This and similar work will be useful to understand maladies like epilepsy… We should all remember that pure research is something we need to foster, because in the end it will unlock incredible results.

Written by Giovanni

February 12, 2023 at 5:24 pm

Posted in Varie

The Neuro-Symbolic Concept Learner

with one comment

Reading the Explainable AI paper I have found a reference to Neuro-Symbolic approach: extracting and working with symbols would indeed make neural network predictions human interpretable. One referred article was about answering questions on still life simplified scenes using neural networks; for instance “is the yellow cube of the same material of the red cylinder?”.

https://cs.stanford.edu/people/jcjohns/clevr/

You see above a picture taken from the CLEVR dataset project. They provide images with simple geometric object paired with questions and answers, to enable ML models benchmarking. The shapes used and the questions structure is on purpose limited and well defined, to make the problem approachable.

Having been exposed, long time ago, to languages like Prolog and Clips I was expecting some mix of neural networks and symbolic programs to answer the questions: they were in my mind quite complementary. Symbolic programming to analyze the question and evaluate its result, neural networks to extract the scene features… but I was wrong, in the following paper all is done in a much more neural-network way

Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, Jiajun Wu. THE NEURO-SYMBOLIC CONCEPT LEARNER: INTERPRETING SCENES, WORDS, AND SENTENCES FROM NATURAL SUPERVISION. ICLR 2019

http://nscl.csail.mit.edu/

The neuro-symbolic concept learner (NSCL) is composed of 3 modules, a neural perception module that extracts latent features from the scene, a semantic parser that analyze the questions and a program executor that provides the answer. What surprised me, and I am still not clear on how it works, is that all the modules are implemented with neural networks and is therefore possible to train them in the neural network way. Citing the authors:

…We propose the neuro-symbolic concept learner (NS-CL), which jointly learns
visual perception, words, and semantic language parsing from images and question-answer pairs.

Starting from an image perception module pre-trained on the CLEVR dataset, the other modules are trained in a “curricular way”: the training set is structured so that in a first phase only simple questions are proposed, and in later steps things get more complicated. First questions on object level concepts like color and shape, then relational question such as “how many object are left of the red cube”, etc.

The visual perception module extracts concepts like in the following picture taken from the paper:

Relations between objects are encoded in a similar way. Each property will be a probabilistic value, of being a cube, of being red of being above the sphere… Having this probabilistic representation is possible to construct a program that use the probabilities to compute the result. For instance you can define a filter operation that filters all the cube objects, just selecting the object that have high probability of being a cube and discarding the others. The coefficients of this filter operation will be learned from the training data set.

A question will be decomposed in a sequence of operations like: Query(“color”, Filter(“cube”,Relation(“left”,Filter(“sphere”, scene)))) -> tell me the color of the cube left to the sphere. All the operation works with probabilities and concepts embedding.

It is not clear to me how the parsing and the execution works, the authors say they used bidirectional GRU for that. Also the parser is trained from the questions, in my understanding generating parse trees and discarding those that executed do not lead to the correct answer. This part is too short in the paper, I will try to dig more into this in future. I feel also missing some examples on how the features are represented.

Anyway, as the execution is decomposed in stages have a symbolic meaning (filter, relation,…), it is easy to understand “why” the ML has chosen an answer. If that answer is not correct you can look backward in the execution and see if the attribute extraction was wrong or the problem comes from some other stage. Much more XAI oriented than a simple neural network. There are a lot of interesting references to have a look to in this article, I will try to dig further.

Written by Giovanni

February 5, 2023 at 11:32 am

Posted in Varie