Should we train Artificial Intelligence Models as we train kids?
Suppose you want to train a model that recognizer handwritten digits: you define some architectures and algorithms that you want to try, and train them using a set of images associated with the corresponding digit – the training corpus. You then evaluate each model performance, and chose the best one. This is a very typical approach, but it is not the same way we have been trained when we were kids.
With respect to the digit example, the training corpus will be composed of a mix of very well written digits, poorly written digits, some will be very masculine, some very feminine, etc. Classifying some examples will be more difficult than others, not all the examples belongs to the same class of difficulty. What we usually do, with machine learning, is shuffling these examples, and throwing them all together to the model, hoping that it will find its way out.
Suppose that the same approach would have been followed with you when you were a kid: do you think giving you oddly written characters in a disparate order would have been a good way to learn recognizing digits? It’s of course the same with other subjects, you have been trained on doing sums before subtractions and so on, not mixing them together. When we train humans (or animals), we start with some simple tasks and then we increase the complexity, until the student is able to solve difficult problems. This approach is called “Curriculum Learning”.
This week I have read this article, that is precisely on this concept:
Curriculum Learning. Yoshua Bengio, Jérome Louradour, Ronan Collobert, Jason Weston. Proceedings of the 26th International Conference on Machine Learning, Montreal, Canada, 2009.
https://ronan.collobert.com/pub/2009_curriculum_icml.pdf
To argument the intuition, the authors start considering the fact that the loss function of a neural network is a non-convex function. This has a precise mathematical definition, but to make the things simpler look at the picture below:

On the left side you have a convex function, it is easy to spot the minimum of this function, at x=0.5. On the right the function is non-convex: you have many local minimums. The problems with these function is that the learning algorithm can be trapped into a local minimum, and do not find another place where the error function has a lower value.
When you train a neural network you start with some initial parameters, and the algorithm should tune them little by little to find the global minimum – the minimum of all the minimums. It has been observed that can be very hard find a good starting point for the training. For this kind of problems a method called “continuation” exists. With this method you do not start immediately optimizing on the very complicated target function, but you first start on a more simple one. Once you find a minimum for the simpler one, you use this minimum as starting point for the next part of the training.
To say it in a more mathematical language, you introduce a parameter λ whose value can be between 0 and 1. When λ is zero, you work on a very simplified problem, when λ is 1 you work on the target problem – the very complex one. You arbitrarily decide how many steps you want to introduce: suppose you just have λ=0 or λ=1, for simplicity. You then take your corpus and you assign some of the examples to the set with λ=0 and the others to the set with λ=1.
In this way you are creating some “classes”. The model will have to graduate on class λ=0 before going to the next level, a bit like at school. You may think that it is difficult to introduce this classification in your corpus: but maybe you can find some measure that helps understanding the example difficulty, and do the repartitioning in classes automatically. You could also decide that some input parameters are for sure relevant, and some others may be relevant – and then trains first the model with less inputs introducing later the others (the lower the λ is, the more input parameters are turned to zero).
The authors make an example training a neural network to recognize images with ellipses, triangles and rectangles. They first generated a corpus only with circles, squares and equilateral triangles. The more complex pictures are introduced later, along with less contrasted backgrounds. Another experiment is with Wikipedia’s phrases: you want to build a model that predict the next word knowing the previous s words, for instance s=5. First you start training a model that works only with the 5000 most common English words, then you train it again with the 10’000 most common words and so on.
What are the benefits in the end? In theirs experiments the authors have found that the models trained with curriculum learning perform better in the test phase – they are more probably less over-fitting. The training speed has also been improved, because the learner wastes less time with noisy or harder examples. There is no bulletproof strategy to split the training into classes – but this is too much to ask to a single article, and is very debatable subject even on humans: what should you teach in first grade to a kid? The article dates of 2009 – there is probably much more to know on this subject if you are interested.
Leave a comment