What is Semantic Hashing?
Suppose that you have a large collection of text documents, how could you identify those that are very similar to a given one? Traditional text hashing does not help, as it assigns a completely different value to texts with small differences; you would instead want a magic function that gives similar codes to similar contents. In 2008 Ruslan Salakhutdinov and Geoffrey Hinton published the “Semantic Hashing” paper on the “International Journal of Approximate Reasoning”, showing how deep neural network can solve this problem.
Suppose that you have a text document: it is composed of words, and these words can be repeated many times. Some words are very similar, some are synonyms, there are many cases and stemming can be useful; to a certain degree of approximation cat, cats and catty can be considered the same word. A typical way to analyze texts is to create a vocabulary and count for each document how many times a term appears in each document. This is the basic of TFIDF Term Frequency – Inverse Document Frequency technique. Counting the words is not enough, rare words – those appearing in fewer documents – are more valuable than common words: paleontology makes a document much more specific than cat. Methods based on this concept are the reference in text retrieval applications.
In my previous posts I have described Autoencoders and Deep belief networks, these concepts are needed to understand how semantic hashing works. Let’s keep from TFIDF the vocabulary concept and let’s represent each document as a vector of numbers v, where each element represent the number of times that word appears in the text. The larger the vocabulary is, the longer will be the vectors v. Here we use v for visible and not w as word, because semantic hashing is based on Restricted Boltzman Machines, where we have visible and hidden variables.
With autoencoders we where interested in starting from visible input, mapping it to a short representation with a restricted number of hidden variables, which is good enough to reconstruct the provided input. In our context, we will start from word counts, pass through an hidden h hash vector that represents the semantic content, and we want to make sure that h is good enough to recover the word counts provided in input.
To be useful these h variables must have also another property: they must be binary. For instance we will map every text into a 128 binary value vector – and this will represent its semantic content. 128 bit is just 2 memory words for many microprocessors, so with a ridiculous quantity of memory you get a very good representation of the documents. Moreover CPU manipulates bits at a tremendous speed, so comparing billions of such 128 bit codes can happen in milliseconds.
Even more, if you choose smaller sizes, 20 bits for instance, you have 1048576 combinations. This is small enough to be used as an address: to each number you assign a list, containing the documents associated to that semantic hash. Small bit changes to an address identify similar semantics, so you can decide to collect all the documents that differs of just 4 bits and find very similar documents with nearly no effort.
The actual values of each h element have no human understandable meaning, they are just cells in the deep belief network memory. We can map a document to some useful code, that we are not able to understand, but anyway this will make our task given document x give me those more similar achievable in very short time.
Now let’s have a look at the paper content. The tests were made on 2 text corpora: 20-newsgroups corpus contains 18,845 postings taken from the Usenet newsgroup collection, the Reuters Corpus Volume I is an archive of 804,414 newswire stories5 that have been manually categorized into 103 topics. Each text has been converted into a v vector of inputs, and used to train Deep Belief Networks like the one below:

The bottom layer receives the word count, and the layers above perform the non linear mapping that allow passing from a 2000 dimension vocabulary to a 128 bit semantic hash. Using just 2 layer, as in the restricted Boltzmann machine, is not enough to have a good mapping. The weights in the layers are initialized randomly, but as you can see the architecture is a 4 RBMs stacked one above the others. In this case it is possible to use the layer-by-layer learning procedure, to speed-up the learning. Done this, the parameters learned are good enough for a fine tuning. The above architecture is converted into an autoencoder:

As before the original word counts enters from the bottom, and they are propagated via neuron up to the 128 bit layer. At this stage some noise is added: this is a trick, as they want h to be of bits, they add this noise to force the sigmoid neurons to work far from the 0 value, where there is a linear transition. The injected noise makes the sigmoid works ethier close to 0 or close to 1.
In the post on autoencoder I said the decoder part should be simple, because otherwise the autoencoder can learn and identity function. Here there is a transition from 2000 integer number space to a 128 bit space and then back to 2000 interger, and moreover the neurons coefficients W1 W2 and W3 are kept the same value for the decoder (top 3 layers) and encoder part (bottom 3 layers). The results produced are very good.
For the model they are using these formulas:

The first one is linked to the text analysis problem. Word counts are considered as independent arrival events, so they have chosen to use the Poisson distribution. They apply some normalization, multiplying by N, that is the document length. They say this is important to be able to deal with documents of disparate lengths. Ps is the Possion distribution formula. For the h parameter, they just use the sigmoid, usually used in neural networks and dealing well with binary values. When they moved from 2 layer to 4 layers, they used the Poisson model just for the input layer. The lambdas and b coefficients are the biases, as usual the w are the neuron weights.
They evaluated Semantic Hashing performances using Precision and Recall, showing that it works very well. Also, combining Semantic Hashing with TF-IDF, it is possible to improve performance and retrieval time. The article provides much more details and is not to difficult to read it, this is its link https://www.cs.cmu.edu/~rsalakhu/papers/sdarticle.pdf if you want to read it.
What’s an Autoencoder?
This term pops out quite often in papers, but what’s an Autoencoder? It is a machine learning model trained so that it is able to reconstruct its input. This seems quite crazy at the beginning, it is just x = f(x), what is the point in creating a model that is equivalent to the identity? Chapter 14 of Deep learning by Ian Goodfellow explains why this can become useful in 25 pages. This article is just a short resume and you are invited to follow the link and read the whole chapter.
As usual let’s x be a vector of inputs. The autoencoder internally will be divided in 2 layers, an encoder which computes h=encode(x) and a decoder which computes y=decode(h); ideally you should obtain x=y=decode(encode(x)). The crucial point is that the h hidden vector is not the same dimension of x and can have useful properties. For instance x can be a thousand of pixels from an images, and h can be just composed of tents of elements. When the size of h is smaller than the size of x we speak of undercomplete autoencoders.
If we allow the decoder layer to be powerful and complex, like a deep neural network, we will end up in having a model that just learns the identify function, and this will not be useful. We will instead allow only the encoder to be complex, the decoder must be simple. We are interested therefore in obtaining an useful h representation of the complex input, and we use the decoder part just because we want to do unsupervided learning. We do not need to define and classify into h dimensions all the inputs we have, we just want the model to obtain by itself an h that has the useful properties we are interested in.
Which useful properties do we want to impose to h? Sparsity is an interesting property: if h is sparse a small input change won’t influence much the h representation. The encoder will extract some brief representation of the input, and in practice we will use this representation to compare between them different inputs. At the chapter’s end there is a reference to universal hashing, recognizing similar texts by comparing the h vector; an interesting topic I would like to describe in my next posts. The encoder can also be used as generative model, given a change in the h state you can check what is the corresponding input, good to visualize what the model is considering
Sparsity is obtained by adapting the loss function used during the training
L(x, decode(encode(x))) + regularize(h)
The regularize function can for instance penalize elements with too high value. In our case we may want to have it done using rectifier units: ReLU will naturally move to 0 all elements where h is near to zero or negative, while keeping the value for positive h elements. The representation we will obtain will become sparse.
Among autoencoders we have denoising autoencoders. Here the idea is that we do not feed just x to the model, but x+noise, and we still want the model to recostruct x. x=decode(encode(x+noise)). By doing this we force the model to be robust to some small modification of the input, the model will actually provide a likelihood x’, doing a kind of projections of the input vector to the inputs seen in the past. The book gives some nice visual pictures to explain this concept, for instance figure 14.4. The autoencoder has learned to recognize one manifold of inputs, a subset of the input space, when a noisy input comes it is projected to this manifold giving the most promising candidate x’. Citing the authors:
The fact that x is drawn from the training data is crucial, because it means the autoencoder need not successfully reconstruct inputs that are not probable under the data-generating distribution
https://www.deeplearningbook.org/contents/autoencoders.html
Another interesting idea that come about noise and sparsity is the following: what about using sigmoid units to compute the final h representation, and inject just before them a bit of noise? The sigmoids saturates on extreme values: the noise will naturally be discarded if they work far from zero. Injecting the noise forces the h elements to be binary and sparse.
Priors, posteriors and regularization
While reading papers on machine learning I have often seen these terms used. My goal this week was to find a paper on these concepts but in the end I am using Wikipedia to understand them. This is a short summary to avoid forgetting them.

Let X be the result of one or many experiments, X is called EVIDENCE. In machine learning we want to learn a model, based on the evidence, that can be used to predict new results. The model will be based on PARAMETERS, let’s call them θ. Remark: both are vectors and not scalars.
Now we can now look at conditional probabilities:
is the evidence probability given the parameters, it is called LIKELIHOOD function. I have encountered this term in a statistic class about estimation: you want to estimate the parameters given the evidence, and then you search the argmax
, the θ that maximizes the probability of obtaining the evidence.
Reading https://en.wikipedia.org/wiki/Likelihood_function some details are highlighted: X in considered a random variable and the observation should be labeled x (vector). It is then possible to distinguish between

and
P( θ | X=x).
The first one, L, is the probability of obtaining x given a specific θ. The second one is the probability that θ is the right parameter given the evidence. It is not possible to conclude that they are the same if you don’t have more knowledge on the process. The likelihood page gives the example of flipping twice a coin given θ the probability of extracting head. θ can vary between 0 and 1, a regular coin has θ=0.5. You can plot a graph of L given that two head have been extracted: for θ=0.5 it is 0.5*0.5, if θ=1 always head then L=1*1. Actually L=θ*θ in this case and the plot is just a parabolic curve, and it is not a probability! the area behind the curve is 1/3, while a probability should be normalized to 1.

What I find very confusing about the notation is that you write L(θ|x) but the two symbols are actually reversed P(x=X | θ) in the definition. So when you do a maximum likelihood estimation you choose the most promising θ, and in the 2 head extraction you should pick θ=1 and non the θ=0.5 of a regular coin.
If you reverse the parameter and consider P(θ|X=x) you are considering the POSTERIOR, so the probability of θ given that x is already happened. If you are considering just P(θ), unconditioned to x observations, you are considering the PRIOR. Applying the usual Bayesian rule:

the POSTERIOR is proportional to the LIKELIHOOD times the PRIOR. P(x) is called MARGINAL and is the probability of having X=x given all possible θ values – a regularization term.

Not easy to remember, but at least the posterior is posterior because it comes after x is know, and is obtained from the prior. Wikipedia page on posterior probability is https://en.wikipedia.org/wiki/Posterior_probability.
Given this knowledge one can give a sense to phrases like:
In explicit regularization, independent of the problem or model, there is always a data term, that corresponds to a likelihood of the measurement and a regularization term that corresponds to a prior. By combining both using Bayesian statistics, one can compute a posterior, that includes both information sources and therefore stabilizes the estimation process. By trading off both objectives, one chooses to be more addictive to the data or to enforce generalization (to prevent overfitting)
https://en.wikipedia.org/wiki/Regularization_(mathematics)
I wanted to understand what regularization is, and phases like the above one did not make sense before checking the prior and posterior definition. For regularization they mean to chose a target function to be minimized that depends both on the evidence x=X and P(θ) the prior. In deep learning there is often a problem of θ parameters becoming bigger and bigger, they can diverge to infinity while being estimated with the gradient descend. It is unrealistic to allow θ values going to infinity, you can introduce a term that depends on | θ | and penalizes these estimation.

This from the definition on Wikipedia’s page, but it does not put in evidence the dependence on θ. It should be something more like:

You search the θ that minimize a function composed of the prediction error plus the modulus of the parameters multiplied by an hyper-parameter lambda. Here y_i is the i-th result and xi is the i-th input. The sum of squared differences and the use of euclidean distance are arbitrary, and many experiments have been made with different functions. Also the formula reminds the bayesian one, but you do not see any probabilities there.
In machine learning papers often two regularization functions are used, L1 and L2. L2 (also known as Ridge regression) is the usual |θ| = sqrt(sum θ_i ^2) while L1 (aka LASSO) is just sum(abs(θ_i)). You could also introduce L0, which counts just the number of nonzero parameters. LASSO and L0 helps sparsity because promote zeroing out parameters, and this in general is good because makes models more explainable and robust: only fewer dimension will influence the model prediction.

In “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” also another regularization was presented and supported because was improving model performances: the MAX-NORM, that is just L2 with | θ | < c an arbitrary constant (usually 3 or 4). Using it you constrain the θ parameters to stay withing an hyper-sphere of c radius so that networks parameters will never get too high. If you want to know more about max-norm the original article is https://home.ttic.edu/~nati/Publications/SrebroShraibmanCOLT05.pdf , but it is a very technical math paper.
Reading the regularization page you will see also other interesting points: how regularization is used in least squares estimation, the fact that early stopping can be seen also as a form of regularization, and much much more.
What is “Dropout” in neural networks
Neural networks are composed of layers of neurons arrays, and are trained using back-propagation. Since these structures are composed of thousand of elements and millions of connections it is hard that structure and specialization of some neuron sets emerges during the training. Often neurons co-adapt, they provide the desired output but considering odd relations on input data. The “dropout” technique has been introduced in 2014 and promotes neurons independence and specialization, improving the prediction performance.
I am referring to this paper:
Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov
Journal of Machine Learning Research 15 (2014) 1929-1958
It is easy to explain what dropout is: suppose you have a 2 layer network, one composed of 100 neurons and the other of just 10. During a normal training all the parameters of these neurons are tuned together. With dropout instead, you turn on or off each neuron independently from the others during all the training time. For instance the first samples minibatch will be used to train neuron 1, 7,8,14… while the second minibatch will be used for 2,5,7,11,14… It is like at each time you train a different model, but this is not completely true because a neuron keeps its own parameters during all the training. Let’s p_1=80% be the probability of keeping active one neuron in layer 1, and p_2=50% be the probability of keeping active one neuron in layer 2. In this case you do not train all the 100 neurons by 10 neurons for all the training, but at each step you train a subnetwork of 80 by 5 neurons.

There is one point to be considered about the parameters learned. At the training end, how do you use the model? You simply turn on all the units, BUT YOU SCALE THE LEARNED WEIGHTS WITH P, this because you trained many many smaller models and theirs weight must be adjusted to maintain the right proportions. The final model has more neurons than those used during the training, in the example above was 100*10 instead of 80*5. You are creating a larger model by combining together an ensemble of many smaller ones. This differs from the usual ensemble approach, because the model trained here have the same architecture and shares parameters between them. In classical ensembles, you put together model with different architectures trained independently. One note about Restricted Boltzmann Machines / Deep Belief Networks : you use p just on the hidden layer, not on the visible one, see the paper for details.
What are the effects of dropout on the trained model? The best thing I can do to explain it is copy paste a couple of picture from the original paper

In the picture above you clearly see that the dropout filters are very specialized, they work on specific part of the image. The classical learning filter instead are blurry and do weird correlation between distant pixels.
One another important point is about activation and model sparsity: see for instance picture 8 in the original paper

By applying the dropout we reduce the number of neurons used to compute the results: less neurons are involved, with higher activation. This is beneficial because it will make the model more robust – one output will depend on less inputs, so a small change in something unrelated will not make the prediction change. Also the model will consume less energy to compute the result, because many neurons will not work and produce just 0 output.
The suggested value of p is 80% for input layer and 50% for the hidden layers, but this is just another hyperparameter and you will need to do some experiments to tune it.
So why not applying dropout each time? Training will require more time! You are training many different models and putting them together, in the end you will have a better model, but it will take 2 or 3 more time.
If you read the original paper you will find details on the math behind this approach. Activating randomly the neurons can be seen as a Bernoulli extraction, and you could use other distributions like the Gaussian instead, Averaging the models together can be done in other ways than just scaling the parameters, but this method works quite well in the end. All the method can be seen as a regularization approach, like adding noise in denoising auto encoders. You will also learn about interaction with pre-trained model (see my previous post on deep belief networks), and see many many examples on where drop out can be used and how much it increased the performance.
How a simple activation function changed deep learning
Activation functions are at the heart of each neuronal unit: once the weights multiplied with the inputs and a bias added, this linear value is passed through a nonlinear function, and this become the input value for the next neural network levels. In the past just a couple of functions were used, the sigmoid and the hyperbolic tangent: these two functions are quite smooth on the borders and at some point it has been discovered that a much simple function could do the work better. This post resumes the content of “Deep Sparse Rectifier Neural Networks” appeared in January 2010 into the Journal of Machine Learning Research, were Xavier Glorot, Antoine Bordes and Yoshua Begio changed again the deep learning landscape.
Below you see the definition and a picture of the most common activation functions:


The sigmoid gently goes from 0 to 1, while the tanh goes from -1 to 1; I have already encountered them while reading of Long Short-Term Memories (LSTM) where they are extensively used to block or propagate memories of previous events. You see there are a lot of exp functions to calculate, by converse the hinge is just trivial and does not require complex hardware. The hinge just says “below this hyperplane I do not care what happens”.
Coming back to the paper, in 2006 Hinton introduced his layer-by-layer initialization procedure, that allowed training deep networks from a good starting point. The authors were investigating why the usual training procedure was not working well, and Hinton’s one was better. They decided to experiment with the simple rectifier function, and they figured out that it was giving very good results; so good that in many cases the pre-training procedure was not needed anymore.
One caveat is that rectifiers need also an L1 normalization, as the function to the right of 0 is purely linear the model can be affected by unbounded-activation problems: the coefficients can grow and grow in values as there is no penalty for using very high values for coefficients. The regularization keeps the learned parameters low penalizing models with very high coefficients. Sigmoids and tanh in that case saturates to 1.
It is not a limitation to have an asymmetric function as the rectifier: in case something symmetrical needs to be learned, it is possible to do it with a couple of rectifiers with the sign swapped. Some more neurons can be needed, but with a much simpler activation function. The paper does interesting comparisons with the “leaky integrate-and-fire” model used to represent real biological neurons activation function: also this model presents an asymmetry at 0, thing that makes tanh models implausible – they force an unnatural antisymmetry in the learned model.
The paper also describes another point: the percentage of neurons producing something different than 0 is very low in nature (1 to 4%) while with sigmoids all of them are calculating something more or less at 0.5. A model with rectifiers is much closer to what happens in nature: when something is not interesting 0 is propagated, no energy will be required, clear paths are selected in the network. The sparsity is an interesting point, also because it is a sign of disentangling: only some changes to some variables affects the output, with very entangle models instead a small change in whatever variable affects the output. A disentangled model is therefore more robust and explainable, just some paths are taken to compute the result.
Another point is the gradient vanishing problem, as the rectifier is not smoot at all, the gradient will be propagated to back to the previous network layers, and training will be possible even with many many layers stacked. A potential problem, by converse, is that when the rectifier outputs 0, there is no gradient at all propagated – it is just a flat zero line! But after some experiments with a comparison with a smoothed version of the rectifier, they realized that this is not really a problem. In some architectures, like denoising autoencoders for unsupervised learning – some the softplus function or a normalization will be needed to avoid this 0-ing out problem; see the paper for details.
The authors report the result of many experiments. On image recognition, the rectifiers have been able to learn a very good model without the pre-training procedure, and this creating a good sparse model. In tasks such as sentiment analysis instead, the pre-training is still useful even though if many classified samples are available it becomes less useful.
The authors have done a great job in making this complex matter clear and understandable, I really encourage you to read the original article if you have time.
Deep Belief Networks overview
This model has a vaste literature and its roots in physics: it is not easy to approach it, but it has given me many hints on how neural networks can be inialized and how generative models works.
A couple of weeks ago I was reading a paper on neural network initialization: there was a short deference to deep belief networks (DBN) and I wanted to know more, as I was searching for a more teorethical work. I decided then to read:
A fast learning algorithm for deep belief nets
Geoffrey E. Hinton, Simon Osindero, Yee-Whye Teh
Neural computation, July 2006, https://dl.acm.org/toc/neuc/2006/18/7
The article is one of the most referenced on the subject, but it is very difficult to understand, because it requires to have already a background on the terms and the math needed in DBN. I have then found an overview article, which gently introduces the subject:
Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey
Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, Mark Crowley
arXiv:2107.12521v2 [cs.LG] 6 Aug 2022
This last article starts reviewing the history of this model and comparing it to other models like Boltzmann machines. In the beginninning there was the Ising model, a phisical model created to explain interaction between electrons with different spins. As it has been born in the physics field, this model has a link to concepts like Entropy, Energy and Hamiltonians. It is for this reason that the term energy pops out from times to times when reading the first paper; it is said Boltzmann machines are energy based models.
But what are these Boltzmann Machines, and how do they relate to neural networks and machine learning? Well suppose you have two neuron layers: the visible layer and the hidden layer. Neurons in the visible layer are directly connected to visible inputs: for instance each pixel of an image. The first paper is on recognizing handwritten digits, so let’s say visible neurons are connected to pixels of the input image. The hidden layer neurons are there to map the visible input to latent variables, for instance the digits you have to recognise. So the number of input neurons is different from that of the hidden layer.
In reality, for the digit recognition problem, the 1st article propses a 28×28 input grid connected to 500 neurons in the 1st layer, connected to a second layer of 500 neurons, connected to a 2000 unit layer, finally connected to a 10 neuron layer – one for each digit. By the way this is a DBN and not a Boltzman machine, that has just 2 levels.
Coming back to the Boltzmann Machine, it has not just links between visible and hidden neurons, it has also connections between visibile neurons and connection between hidden neurons. To describe a BM you need therefore 3 matrixes, one representing the weights of connections between visible and hidden units, one for connections between visible units and one for connections between hidden units. You will need also 2 vectors of biases for the 2 neuron sets.
Another original property of this model is that these links are bi-directional: you do not just propagate the values from the input layer to the hidden, but also the other way back. This is a particularity of Markovian networks, when you have directed links instead you are dealing with Bayesian networks. The procedure used to draw samples from BM is called Gibbs Sampling: you start from a random point – not necessarily a valid one – and you extract one variable at the time, considering the conditionate probability of that variable to all the other ones. You repeat this for all the variables, and you do it again some times until stabilization. The math literature says that this process will not require too many iterations to reach stability.
As you may imagine, the fact of allowing interaction between hidden and hidden variables, and visible and visible variables, complicates the model and practically may not be necessary. We speack then of Restricted Boltzmann Machine when you only have one weight matrix, that is the interaction between visible and hidden variables. This matrix is simmetric, because the links are bidirectional and its diagonal is filled with zeros, as one variable does not depends from itself. The coefficients will be learned using maximum likelihood estimation.
It is also remarkable that these networks can be considered associative memories. The authors of the second paper refers to this as Hopfield networks: some variable are gated by a coefficient theta, if they are above theta they are considered 1 otherwise 0 (or -1).
Coming back to RBM they have anoter important property: conditional indipendence of variables, this because hidden and visible variables do not have direct connections among them. Each visible variable is independent from the others visible variables, and the same holds for hiddend variables. This makes the formals treatable and lead to simple multiplication of probabilities. The formulas are derived in the second paper. Also in the second paper is explained how to concretely implement the Gibbs sampling, it is an iterative generation of probability of hidden variables given the visible variables, followed by genertion of proability of visible variables given hidden variables. This repeated some times.
Given this procedure, it is also possible to do something unexpected: what if instead of starting from visible variables (the pixels of the digit value) you start from the hidden variables? The hidden variable represent the latent classification, so you could say that you fix the variable used to say that the class is an eight and you look at which pixels are activated in the image after some iterations. It is in this way that RBM can be used as generative models and, looking in the first paper, you will see some examples of machine generated digits (figure 8 and 9).
But how all of this relates to deep learning? In the end here we have just one layer of visible variable and one layer of hidden variables! Let’s start with the input, they will be the visible layer. The first network layer on top of them will be the hidden layer. You can train this submodel independently form the other layest as explained in the first paper. When this is completed you move up of one layer: the hidden variables will become the new visible layer and the untrained layer above the new hidden variables. You repeat this level by level until you reach the top layer of your deep network.
Reading the first paper – even if it is very complex and requires a lot of math – you will realize that this leads toward network training. It is actually a very good starting ponit, and after that you will have a partially trained network that will not suffer of the varnishing gradient problem. In a second phase you can use the gradient descent to fine tune the network, and obtain better results. This has been a majour breaktrough in 2006 and opened the road to deep learning, before this it was not possible to practically train a deep network.
So to initialize the weights of a neural network you can use RBM, they have an huge math background justifying them and have proven theirs capacities. There are many extension referenced from the second article: dealing with multivariate variables, continous variables, introducing time evolutions… It has been definitively worth spending these 2 weeks reading about them.
Weight initialization in neural network training
The neural network training algorithm changes the weigth associated to neural inputs an biases so that the network learns to reproduce the desired output. But how do you initialize all these weights before starting the training? Is there a way to initialize them to speed up the learning?
From my previous reading I understood that neural networks implements non convex functions – functions that have multiple local minimums. These minimums can be a trap: if the training algorithm does not explore enough of the parameter space, the model will not evolve and will produce sub-optimal results. Ideally we would like that the training discover the absolute minimum of the function, so that it can produce the best possible results. We search for a minimum because it represent the minimum possible difference between predicted and training results.
The role of parameter initialization is therefore to chose where the minimum search will start in the parameter space. If we start the search close to the absolute minimum, the trainin will be quick and provide the best results. If we start from the wrong place, the training could even not converge at all.
But is it really possible to do a good initialization? In the end we do a complex and long training because it is not possible to figure out the good model directly. This week I read this paper:
A Weight Initialization Method Associated with Samples for Deep Feedforward Neural Network
Yanli Yang, Yichuan He
ICCDE ’20, January 4–6, 2020, Sanya, China © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7673-0/20/01
https://doi.org/10.1145/3379247.3379253
The authors acknowledge that the most widely used algorithm to initialize the parameters is just a random choice. In the end if you do not know where the minimum is any place is a good starting point. They also speaks of other methods like using stacked auto encoders or deep belief network, but really too quickly in a short paragraph; they did not satisfy my curiosity.
The authors suggest to initialize the Fast Forward Neural Network using this procedure:
- First of all you extract randomly the values for all the parameters; let’s call this set of parameters W0 – weights at time 0
- You then train the network for just one epoch. At this point the parameters will be evolved into W1 – weights at epoch/time. Notice that these weight will incorporate some information coming from the training samples, not too much as is just 1 epoch.
- Restart the training with W* = a W0 + b (W1 – W0), where a and b are two coefficients, let’s say a=0.9 and b=0.3
This W* has therefore a random component and a data driven component. From theirs experiments this allows to reduce of hundred of epochs the training. This is amazing, as W1 comes from just one epoch training, actually it puzzles me. Why just continuing the traning from W1 is worst than restarging from W*? The paper does not provide a teorical explanation, but just concludes telling that more experimentation is needed. It reminds me about curriculum learning: you start training the model with simpler examples and then you move to complex things, trying to move the parameters search more clores to the optimal minimimum. With this initialization you maybe start moving in the right direction, but it seems just an intuition. I am quite unsatisfied of the lack of theory behind the results presented, I will look for more refences to this techinque in the following weeks to seek for some confirmation.
AI Rewrites Coding
Do you want to have an overview of existing tools and some insight on how it works? Read this article on the ACM site: https://cacm.acm.org/magazines/2023/4/271227-ai-rewrites-coding/fulltext
Reading the text you will find references to the projects in IBM, Google and Amazon. Existing solutions are limited to generate 15/20 lines of code for you; looking at the video embedded into the page you can get an idea of how this will be improved in the future.
In about 20 minutes you will get explained that AI/deep learning can provide candidate programs for your problem, and a symbolic process can analyze and prune those that will not lead to a correct solution. Really amazing
Trying Kubeflow
This week I decided to focus a bit on learning Kubeflow instead of reading a research paper. During this year I tried to follow Andrew Ng advice: read at least one research paper a week – actually he suggests reading 2 but I realize I don’t have enough time to do that. At least during these months I felt more happy and I realized I am really learning new things.
Now I would like to put in practice some things I have learned and I tried to approach Kubeflow. On the paper it is very powerful an you could put it on any cloud provider and start cooking your stuffs; in reality I feel like I bumped into a wall. Ouch, it hurts!
First you need to install a ton of operators and component into kubernetes. Ok they provide the templates to do that, and probably you can install all locally with some minikube like environment, but we tried to do that on Azure to see how it should really be. The fact is that you need something to pay for it, you need to have a credit card in your account, and the big company is putting a lot of constraints on how containers and networks should be deployed, which sites you can reach from your pods, how you can connect to them.
I started thinking that anybody else in a start-up can just have a cloud subscription and install the things plain vanilla, while I had to struggle for days, and ask for help. Finally I am not even sure all the issues are solved, just some modules start and I can reach an UI but only via a bridge server. Not really a productive way of working – especially because I just want to try the tool and understand if it can be useful in future.
Once accessing the UI I started having a look at the MNIST example. I have to run the examples on Azure and if you search in the example sources you will see that Azure is not that really present. At least the mnist example seemed clear. You have a notebook that prepares for you an image, containing your model, and you deploy and run it; easy no? No
First you need to do the kubernetes set-up, create storage accounts, create secrets into a namespace, create namespace to runs the notebooks, configure a docker registry… Also the example says it has been tested with a specific image version, but searching for it I did not find it. And once Jupyter is working I realized locally something was missing to have the kubectl api working. I needed to figure out that you can set up a kubectl connection with some azure command line tool, and that all the configuration and secrets go into a .kube/config file that the you can copy on your pod.
Then I started running the code and I saw error messages about module versions incompatibilities… the image I am using is probably newer that the example and, bha I started changing things until something started working.
Now I am stuck with some kubeflow fairing issues: in the code it is checking to see if a secret exists, and the library used is not ok. it says the method is not present. Sure another version incompatibility issue, not an happy Sunday morning experience.
I started thinking that we should just chose an environment available as a service, like azure machine learning studio, and then focus on how to run the model anywhere on other cloud providers. Setting up a whole environment like kubeflow is too complicated for a guy in few days, a more decent amount of time should be allowed. You should also make sure that guys with competences on the cloud provider and the security are available around, because theirs help will be precious.
Language models based on automatized feature extraction
Can a neural network, analyzing a large text corpus, group on its own together words with similar meaning or roles in a phrase?
Last week I was reading an article on Curriculum Learning, and one of the examples was about predicting the next word in a phrase, knowing the previous five words. The example was telling that each word was mapped on an m-dimensional vector, and these vectors where used to predict the next word. I did not know about this technique and this week I have searched a paper to learn more. Nowadays there are many models already available to do this, but I wanted to understand the origin of the technique, and I have read the referenced article:
A Neural Probabilistic Language Model. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Jauvin.
Journal of Machine Learning Research 3 (2003) 1137–1155
Ok, dating 2003 it is for sure not the most recent article on the subject, but it explains clearly and step by step this technique. Language corpora are composed of streams of million words, taken from news, from Wikipedia, or other sources. It is possible to analyze these streams and come up with some statistics: how many words are known, which are the rarest or the more common, what are the words that follow one specific word, and so on. Usually the more rare words are dropped – or better replaced with a special symbol, word with uppercase and lowercase letter are considered different and punctuation signs are take into account. Usually, with this information, it is possible to build an n-grams model: let V be the Vocabulary set, you can compute
P(w_t = v in V) = P(w_t | w_(t-1), w_(t-2), …)
Where w_t is the word at position t; you use the context of the last n-1 words to predict the next one. Of course it is a simplification and you cannot push n to a big number, because the problem will become not manageable and the samples will be very sparse. Another important problem is that this model is not generalizable: if in the corpus you have “the dog bites the sheep” nothing in the n-gram model tells you that “the dog bites the goat” is also a valid phrase.
For the n-gram model, each word is different and there is no relation between goat and sheep; we would like to have better model where you can represent a sort of distance between words, so that you can represent “the dog bites the seep” with “the dog bites X” where X is similar to “sheep”. We know from school that we have grammar classes such as nouns and verbs, and words may be classified in many ways, so that you have words that represents animals, things, concepts, etc. These classifications comes from human experience and it would be extremely cumbersome to redact a database with these relations. The core idea presented in “A neural probabilistic language model” is that:
You can fix in advance a number of dimensions m, and let a neural network learn how to classify each word in these dimensions. This is achieve maximizing the probability of predicting the next word in a phrase, given the n previous words. The m dimensions will be real numbers and not discrete values, in this way it is possible to compute word similarities using a distance formula.
You will not end with clearly understandable dimensions – there will not be the “noun” or “verb” dimension, because these are concepts in our culture and this m-dimensional space just comes from a mathematical optimization. Indeed you could think to initialize these m-dimensional vectors with part of speak information, but this for sure requires a lot of work.
The paper presents experiments with m=30, m=60 or m=100 dimensions, this is by far smaller that the number of words present in a Vocabulary (for instance 17’000). You can thing that you have a table with 17’000 rows, one for each word, and 30/60 real number columns that maps the word in the m-space. Notice that this mapping is not changing with the word position in the stream, its a kind of convolutional operation done on words, so the number of parameters to be trained is treatable. The next word prediction is then made using all the m vectors for each of the last n words in the stream, and combining them with a shallow neural network with a softmax layer on the top to choose the next word.
Notice that, for the training to converge, a regularization function has to be introduced: the learned parameters will not diverge to infinite values because high-module parameters will be penalized.
Another interesting feature is that you could manage also new unknown words with this technique, words that never appear in the V training vocabulary. You predict with the model which word should appear instead of the unknown one: let’s say you predict 10 words. You then use the m dimension vectors of these 10 words and mix them together, averaging them with the word probabilities. This becomes the candidate representation for the new unknown word. In this way you could inject this new representation back in the model and use it in the future.
Nowadays other techniques are used for this prediction problem. Using LSTM or GRU in the neural layer allow to remember for a long time – a whole long phrase – some state, and this improves the performances when dealing with masculine/feminine accordance and so on. This allows to extend the number n of words taken into account in the prediction.