Giovanni Bricconi

My site on WordPress.com

Archive for March 2023

Language models based on automatized feature extraction

leave a comment »

Can a neural network, analyzing a large text corpus, group on its own together words with similar meaning or roles in a phrase?

Last week I was reading an article on Curriculum Learning, and one of the examples was about predicting the next word in a phrase, knowing the previous five words. The example was telling that each word was mapped on an m-dimensional vector, and these vectors where used to predict the next word. I did not know about this technique and this week I have searched a paper to learn more. Nowadays there are many models already available to do this, but I wanted to understand the origin of the technique, and I have read the referenced article:

A Neural Probabilistic Language Model. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin.

Journal of Machine Learning Research 3 (2003) 1137–1155

Ok, dating 2003 it is for sure not the most recent article on the subject, but it explains clearly and step by step this technique. Language corpora are composed of streams of million words, taken from news, from Wikipedia, or other sources. It is possible to analyze these streams and come up with some statistics: how many words are known, which are the rarest or the more common, what are the words that follow one specific word, and so on. Usually the more rare words are dropped – or better replaced with a special symbol, word with uppercase and lowercase letter are considered different and punctuation signs are take into account. Usually, with this information, it is possible to build an n-grams model: let V be the Vocabulary set, you can compute

P(w_t = v in V) = P(w_t | w_(t-1), w_(t-2), …)

Where w_t is the word at position t; you use the context of the last n-1 words to predict the next one. Of course it is a simplification and you cannot push n to a big number, because the problem will become not manageable and the samples will be very sparse. Another important problem is that this model is not generalizable: if in the corpus you have “the dog bites the sheep” nothing in the n-gram model tells you that “the dog bites the goat” is also a valid phrase.

For the n-gram model, each word is different and there is no relation between goat and sheep; we would like to have better model where you can represent a sort of distance between words, so that you can represent “the dog bites the seep” with “the dog bites X” where X is similar to “sheep”. We know from school that we have grammar classes such as nouns and verbs, and words may be classified in many ways, so that you have words that represents animals, things, concepts, etc. These classifications comes from human experience and it would be extremely cumbersome to redact a database with these relations. The core idea presented in “A neural probabilistic language model” is that:

You can fix in advance a number of dimensions m, and let a neural network learn how to classify each word in these dimensions. This is achieve maximizing the probability of predicting the next word in a phrase, given the n previous words. The m dimensions will be real numbers and not discrete values, in this way it is possible to compute word similarities using a distance formula.

You will not end with clearly understandable dimensions – there will not be the “noun” or “verb” dimension, because these are concepts in our culture and this m-dimensional space just comes from a mathematical optimization. Indeed you could think to initialize these m-dimensional vectors with part of speak information, but this for sure requires a lot of work.

The paper presents experiments with m=30, m=60 or m=100 dimensions, this is by far smaller that the number of words present in a Vocabulary (for instance 17’000). You can thing that you have a table with 17’000 rows, one for each word, and 30/60 real number columns that maps the word in the m-space. Notice that this mapping is not changing with the word position in the stream, its a kind of convolutional operation done on words, so the number of parameters to be trained is treatable. The next word prediction is then made using all the m vectors for each of the last n words in the stream, and combining them with a shallow neural network with a softmax layer on the top to choose the next word.

Notice that, for the training to converge, a regularization function has to be introduced: the learned parameters will not diverge to infinite values because high-module parameters will be penalized.

Another interesting feature is that you could manage also new unknown words with this technique, words that never appear in the V training vocabulary. You predict with the model which word should appear instead of the unknown one: let’s say you predict 10 words. You then use the m dimension vectors of these 10 words and mix them together, averaging them with the word probabilities. This becomes the candidate representation for the new unknown word. In this way you could inject this new representation back in the model and use it in the future.

Nowadays other techniques are used for this prediction problem. Using LSTM or GRU in the neural layer allow to remember for a long time – a whole long phrase – some state, and this improves the performances when dealing with masculine/feminine accordance and so on. This allows to extend the number n of words taken into account in the prediction.

Written by Giovanni

March 26, 2023 at 1:55 pm

Posted in Varie

Should we train Artificial Intelligence Models as we train kids?

leave a comment »

Suppose you want to train a model that recognizer handwritten digits: you define some architectures and algorithms that you want to try, and train them using a set of images associated with the corresponding digit – the training corpus. You then evaluate each model performance, and chose the best one. This is a very typical approach, but it is not the same way we have been trained when we were kids.

With respect to the digit example, the training corpus will be composed of a mix of very well written digits, poorly written digits, some will be very masculine, some very feminine, etc. Classifying some examples will be more difficult than others, not all the examples belongs to the same class of difficulty. What we usually do, with machine learning, is shuffling these examples, and throwing them all together to the model, hoping that it will find its way out.

Suppose that the same approach would have been followed with you when you were a kid: do you think giving you oddly written characters in a disparate order would have been a good way to learn recognizing digits? It’s of course the same with other subjects, you have been trained on doing sums before subtractions and so on, not mixing them together. When we train humans (or animals), we start with some simple tasks and then we increase the complexity, until the student is able to solve difficult problems. This approach is called “Curriculum Learning”.

This week I have read this article, that is precisely on this concept:

Curriculum Learning. Yoshua Bengio, Jérome Louradour, Ronan Collobert, Jason Weston. Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, 2009.

https://ronan.collobert.com/pub/2009_curriculum_icml.pdf

To argument the intuition, the authors start considering the fact that the loss function of a neural network is a non-convex function. This has a precise mathematical definition, but to make the things simpler look at the picture below:

On the left side you have a convex function, it is easy to spot the minimum of this function, at x=0.5. On the right the function is non-convex: you have many local minimums. The problems with these function is that the learning algorithm can be trapped into a local minimum, and do not find another place where the error function has a lower value.

When you train a neural network you start with some initial parameters, and the algorithm should tune them little by little to find the global minimum – the minimum of all the minimums. It has been observed that can be very hard find a good starting point for the training. For this kind of problems a method called “continuation” exists. With this method you do not start immediately optimizing on the very complicated target function, but you first start on a more simple one. Once you find a minimum for the simpler one, you use this minimum as starting point for the next part of the training.

To say it in a more mathematical language, you introduce a parameter λ whose value can be between 0 and 1. When λ is zero, you work on a very simplified problem, when λ is 1 you work on the target problem – the very complex one. You arbitrarily decide how many steps you want to introduce: suppose you just have λ=0 or λ=1, for simplicity. You then take your corpus and you assign some of the examples to the set with λ=0 and the others to the set with λ=1.

In this way you are creating some “classes”. The model will have to graduate on class λ=0 before going to the next level, a bit like at school. You may think that it is difficult to introduce this classification in your corpus: but maybe you can find some measure that helps understanding the example difficulty, and do the repartitioning in classes automatically. You could also decide that some input parameters are for sure relevant, and some others may be relevant – and then trains first the model with less inputs introducing later the others (the lower the λ is, the more input parameters are turned to zero).

The authors make an example training a neural network to recognize images with ellipses, triangles and rectangles. They first generated a corpus only with circles, squares and equilateral triangles. The more complex pictures are introduced later, along with less contrasted backgrounds. Another experiment is with Wikipedia’s phrases: you want to build a model that predict the next word knowing the previous s words, for instance s=5. First you start training a model that works only with the 5000 most common English words, then you train it again with the 10’000 most common words and so on.

What are the benefits in the end? In theirs experiments the authors have found that the models trained with curriculum learning perform better in the test phase – they are more probably less over-fitting. The training speed has also been improved, because the learner wastes less time with noisy or harder examples. There is no bulletproof strategy to split the training into classes – but this is too much to ask to a single article, and is very debatable subject even on humans: what should you teach in first grade to a kid? The article dates of 2009 – there is probably much more to know on this subject if you are interested.

Written by Giovanni

March 19, 2023 at 10:13 am

Posted in Varie

The Neural Logic Machines

leave a comment »

A neuro-symbolic model able to learn on its own first-order logic relations and solve some interesting problems, like sorting an array or the block world puzzle.

As promised, this week I have read:

Dong, Honghua and Mao, Jiayuan and Lin, Tian and Wang, Chong and Li, Lihong and Zhou, Denny,

Neural Logic Machines

Seventh International Conference on Learning Representations. https://iclr.cc/Conferences/2019/Schedule

https://doi.org/10.48550/arxiv.1904.11694

Compared to other papers I have read, I have found it both very clear and very dense. While reading it I was highlighting every single paragraph, each single piece seemed giving some important or useful information. Even if it is only 22 pages with the bibliography and the appendix, it will be very hard to resume it in one single Sunday morning post; but unfortunately this is the short time I can devote to this activity.

I found this article while reading about Neuro-Symbolic systems, and as it it different from the other examples I have found; I wanted to have a look. Being Neuro-Symbolic, it is a neural network based model that include also some symbolic form of reasoning over the extracted features. Reading the introduction, I see also that the goal is also to have an inductive learning model: you do not program the logic rules, you provide examples and the model extracts the rules on its own! This is very fascinating, but of course it comes with some hard limitations. Notice also that it will extract some rules, but they will be in the form of neural weights and very likely they will not be human interpretable, even tough the rules will be connected with logic operators.

The training of this model has to be similar to the way an human is trained: just because NLM has to extract on its own the rules it is not possible to start with difficult cases. This way of training is called curriculum learning: the examples are organized in a set of levels sorted by complexity. A model to pass a level must reach a certain performance: if it does not reach it, it will be discarded, otherwise it will be kept and continue to the next level. Examples that make a model fail are very useful, the can be kept and added to the training material, so that the next generation will be better trained. Since the architecture is the same, the model instances will be differentiated by the random seed used to initialize them. At the end of the paper, there is a nice table showing that for complex problems it is hard to train a good model, as fewer of them are able to graduate!

The reference problem used for this NLM is the classic “block world”: some object are laid on the ground, some other stacked on top of them; you can move one object only if it has nothing on top of it and you can put it on the ground or on top of another clear object; you have a start configuration and a target configuration to be reached. A kind of problem that can be modeled using first order logic clauses; but remember the Neural Logic Machine (NLM) has to create the rules on its own, just some initial facts will be provided as training input.

Let’s come to the limitations of this model: you have first-order logic (no rules on object sets, just on single instances). It can produce rules with the “exists” or “for all” quantifiers. It cannot produce recursive or cyclic rules. As meta-parameters you have to provide the maximum “arity” of the deduced rules (maximum number of rule parameters) and the maximum number of rules that are used in a computation. These two limits are called Breath and Depth respectively. The authors say NLM “realize Horn clauses in first-order logic”, to know quickly more see for instance https://en.wikipedia.org/wiki/Horn_clause

The computation is carried on using tensors that represent the probability of one relation to be true – so a true relation between object a and b will be represented by 1, while a false one will be represented by a 0. During the calculation these will be real numbers, but will be generated by sigmoids so they should not be too close to 0.5

The initial facts – 1 arity rules – will be represented by a matrix with the size of the number of input objects and the number of the facts. For instance if you have 3 object and the facts being red and being blue you will have a 3×2 matrix.

Binary predicates will be represented by [m, m-1, r] matrices where m is the number of input instances and r is the number of binary relations. This can continue with ternary relations [m, m-1, m-2, r] and so on. The fact is that you do not need oftern high arity relations, usually ternary relations are enough and you do not need a high Breadth model.

From this you find also an important limitation of NLM: they can work with single instances of objects (1 hot encoded) but they cannot work with real valued input: they can reason with distinct objects in the block world but they cannot reason with the heart-rates, pressures, temperatures etc.

Now that is clear that relation values between objects can be represented by tensors [m, m-1,…r], it is time to understand how the computation evolves in the Depth of the model. There are a set of operation that are defined that allow to compose these tensors. The computation is organized in layers. Each layer will be implemented by a muti-layer perceptron: given the m possible objects all the x-arity input combination will be produced and these will be given as input to replicas of the same multi-layer perceptron. Having the same perceptron means that its parameters will be the same in all the replicas, and they will not be too much – like in a convolutional neural network. The “all possible combinations inputs” causes problems: if you have 3 objects a, b and c, and a 2-arity relation you will have to compute it for (a,b), (a,c),(b,c),(b,a),(c,a),(c,b). If you do it with more object you will see that this is a factorial explosion in the combinations. Ok, machine learning models can have billion of parameter, but having something factorial it is something scaring.

In the and all possible inputs to a relation are generated, they pass in parallel in the same neural network and the result is squashed by a sigmoid. The output is ready to be passed to the next level but, there is more to be done. So far we considered only r-arity, but actually the model in a level considers also r-1 and r+1 arities. It is therefore mixing 2-arity rules also with 1-arity and 3-arity when computing one level. In this way it implements the and/or relation between predicates. At each level the rules are combined and the model can find more complex dependencies.

Notice also that the model implements reductions and expansion on rules. With expansion you consider that a relation like “moveable(x)” can implicitly be thought as “for all y it is moveable(x,y)”. Reduction is the opposite, when you loose one arity in a relation and you compute it in the sense of “exists at least one” or “for all”.

The model can combine together relations with different arities, and consider also the “exists” and “for all” quantifiers. The more times you allow to do this composition, the more complex rules you can compute. You may have noticed that tensors are in the form of [m,m-1,r] and you may wonder how you can consider a relation like relation(x,x); well you will have to deal with relation(x, y!=x) and relationSymmetric(x), so two relations to implement it but NLM can still handle it. I really invite you to read the full articles if you want to understand how the process is done, because it is very clear but requires some pages to be explained.

Once trained on few possible input object, the NLM can scale to an higher number of inputs: this because the relations are ariti based and scaling them to more object just requires to apply the same model to an higher number of permutations. This is a great property: a model trained on 20 instances can scale to 100 instances – but due to the factorial they cannot scale to huge numbers.

The paper evaluates NLM performances with the block world problem, and also with the family tree problem (who is father or x, who is the maternal uncle of x). NLM are able to learn rules that allow to reach 100% success, but they have to be trained with curriculum learning and for some problems more seeds have to be used because the problems are complex and it is not guaranteed to find a model that graduates in complex classes.

If you want to see NLM in action, visit theirs page on Google: https://sites.google.com/view/neural-logic-machines, it is nice to see that this model is able to learn the rules needed to sort an array, find the shortest path in a graph or execute the block world with many objects.

Written by Giovanni

March 14, 2023 at 12:30 pm

Posted in Varie

Neuro-Symbolic AI: three examples

leave a comment »

Some weeks ago I posted the neuro-symbolic concept learner article, and I wanted to know more about this approach. I then read about GRUs, used in that system, and this week it has been the time to learn about Neuro-Symbolic AI in general. I read

Neuro-Symbolic AI: An Emerging Class of AI Workloads and their Characterization. Zachary Susskind, Bryce Arden, Lizy K. John, Patrick Stockton, Eugene B. John

https://arxiv.org/abs/2109.06133

The question behind the article is: “are this new class of AI workloads much different from neural network/deep learning models?”. The authors first provided 3 references to Neuro-Symbolic systems, and then they investigated theirs performances. Theirs first finding is that the symbolic part of the computation is much less parallelizable than the neural network one: the symbolic part requires operations on few parameters and has a complex control flow that is not suitable to run on a GPU. Fortunately the symbolic part is not the one that dominates the response time of those systems. The symbolic part manipulates the features extracted by the neural part, and this is the reason why neuro-symbolic systems are more explainable: it is much easier to understand how the output is decided. This implies that these are really composite systems: one or multiple neural networks focus on some task, while the symbolic part reuses theirs output and focus on providing the output. Another advantage is that Neuro-Symbolic systems can be trained with less samples that pure neural ones.

It is now better to introduce the 3 systems under examination. The Neuro-Symbolic Concept Learner (NSCL) is the same system I described few weeks ago: its goal is to observe an image with many solids of different shapes and color and answer questions like “is the green cube of the same material of the sphere?”. The Neuro-Symbolic Dynamic Reasoning (NSDR) system answers to a much more challenging task: observe a 5 second video where solids moves, collides, or exit the scene and answer a natural language question like “what caused the sphere and the cylinder collide?”. The third example is the Neural Logic Machines (NLM); it is quite different from the previous two because there is no separation between the symbolic and the neural part, even though they are able to perform inductive learning and logic reasoning. NLM are for instance able to solve Block World tasks, given a world representation and a set of rules, decide the actions needed to reach the desired final state. The description provided was quite vague, for this reason I plan to read the referenced paper in the next weeks.

The NSCL is composed of a neural image parser extracting the objects positions and features, a neural question parser processing the input question (using GRU) and a symbolic executor that processes the information and provide the final answer. The NSDR is composed of a neural Video Frame Parser segmenting the images into objects, a dynamic predictor that learns the physics rules needed to predict what will happen to objects colliding, a question parser and a symbolic executor. The symbolic executor in NSCL works in a probabilistic way, it is so possible to train it as neural networks; this is not true for the NSDR where it is a symbolic program and runs only on a CPU really as a separate sub-module.

To pursue theirs goal the authors had to profile these systems, and it has not been always easy to set them up. For this reason they decided to completely replace some components with some others more manageable. This gives an interesting list of modules that can be reused:

Detectron2 is a frame parser that extracts objects position, shape, material, size, color… OpenNMT is a question parser able to translate a natural-language question into a set of tokens, PropNet is a dynamic predictor able to predict collision between objects that may enter and leave the scene.

The paper also describe an useful concept I missed, the computational intensity. Suppose you have to multiply an m*k matrix for a k*n one. in this case the number of operations will be O(mkn), because of the way the algorithm works. By converse the memory needed will be O(mk+kn+mn) so you obtain:

See page 5 in Methodology paragraph

When you multiply big nearly square matrices you have an high intensitiy, and it is more easy to parallelize the work. By converse if you have matrices with some dimension very small (or even vectors) the intensity will be low and parallelization will be more difficult to introduce. The paper explains why the more symbolic parts of the computation are low intensity and difficult or impossible to parallelize with a GPU; luckily their execution does not require as much time as the parallelizable part.

Now I have some more interesting links to investigate, and I will start from NLM.

Written by Giovanni

March 5, 2023 at 8:19 pm

Posted in Varie