Archive for May 2023
What’s an Autoencoder?
This term pops out quite often in papers, but what’s an Autoencoder? It is a machine learning model trained so that it is able to reconstruct its input. This seems quite crazy at the beginning, it is just x = f(x), what is the point in creating a model that is equivalent to the identity? Chapter 14 of Deep learning by Ian Goodfellow explains why this can become useful in 25 pages. This article is just a short resume and you are invited to follow the link and read the whole chapter.
As usual let’s x be a vector of inputs. The autoencoder internally will be divided in 2 layers, an encoder which computes h=encode(x) and a decoder which computes y=decode(h); ideally you should obtain x=y=decode(encode(x)). The crucial point is that the h hidden vector is not the same dimension of x and can have useful properties. For instance x can be a thousand of pixels from an images, and h can be just composed of tents of elements. When the size of h is smaller than the size of x we speak of undercomplete autoencoders.
If we allow the decoder layer to be powerful and complex, like a deep neural network, we will end up in having a model that just learns the identify function, and this will not be useful. We will instead allow only the encoder to be complex, the decoder must be simple. We are interested therefore in obtaining an useful h representation of the complex input, and we use the decoder part just because we want to do unsupervided learning. We do not need to define and classify into h dimensions all the inputs we have, we just want the model to obtain by itself an h that has the useful properties we are interested in.
Which useful properties do we want to impose to h? Sparsity is an interesting property: if h is sparse a small input change won’t influence much the h representation. The encoder will extract some brief representation of the input, and in practice we will use this representation to compare between them different inputs. At the chapter’s end there is a reference to universal hashing, recognizing similar texts by comparing the h vector; an interesting topic I would like to describe in my next posts. The encoder can also be used as generative model, given a change in the h state you can check what is the corresponding input, good to visualize what the model is considering
Sparsity is obtained by adapting the loss function used during the training
L(x, decode(encode(x))) + regularize(h)
The regularize function can for instance penalize elements with too high value. In our case we may want to have it done using rectifier units: ReLU will naturally move to 0 all elements where h is near to zero or negative, while keeping the value for positive h elements. The representation we will obtain will become sparse.
Among autoencoders we have denoising autoencoders. Here the idea is that we do not feed just x to the model, but x+noise, and we still want the model to recostruct x. x=decode(encode(x+noise)). By doing this we force the model to be robust to some small modification of the input, the model will actually provide a likelihood x’, doing a kind of projections of the input vector to the inputs seen in the past. The book gives some nice visual pictures to explain this concept, for instance figure 14.4. The autoencoder has learned to recognize one manifold of inputs, a subset of the input space, when a noisy input comes it is projected to this manifold giving the most promising candidate x’. Citing the authors:
The fact that x is drawn from the training data is crucial, because it means the autoencoder need not successfully reconstruct inputs that are not probable under the data-generating distribution
https://www.deeplearningbook.org/contents/autoencoders.html
Another interesting idea that come about noise and sparsity is the following: what about using sigmoid units to compute the final h representation, and inject just before them a bit of noise? The sigmoids saturates on extreme values: the noise will naturally be discarded if they work far from zero. Injecting the noise forces the h elements to be binary and sparse.
Priors, posteriors and regularization
While reading papers on machine learning I have often seen these terms used. My goal this week was to find a paper on these concepts but in the end I am using Wikipedia to understand them. This is a short summary to avoid forgetting them.

Let X be the result of one or many experiments, X is called EVIDENCE. In machine learning we want to learn a model, based on the evidence, that can be used to predict new results. The model will be based on PARAMETERS, let’s call them θ. Remark: both are vectors and not scalars.
Now we can now look at conditional probabilities:
is the evidence probability given the parameters, it is called LIKELIHOOD function. I have encountered this term in a statistic class about estimation: you want to estimate the parameters given the evidence, and then you search the argmax
, the θ that maximizes the probability of obtaining the evidence.
Reading https://en.wikipedia.org/wiki/Likelihood_function some details are highlighted: X in considered a random variable and the observation should be labeled x (vector). It is then possible to distinguish between

and
P( θ | X=x).
The first one, L, is the probability of obtaining x given a specific θ. The second one is the probability that θ is the right parameter given the evidence. It is not possible to conclude that they are the same if you don’t have more knowledge on the process. The likelihood page gives the example of flipping twice a coin given θ the probability of extracting head. θ can vary between 0 and 1, a regular coin has θ=0.5. You can plot a graph of L given that two head have been extracted: for θ=0.5 it is 0.5*0.5, if θ=1 always head then L=1*1. Actually L=θ*θ in this case and the plot is just a parabolic curve, and it is not a probability! the area behind the curve is 1/3, while a probability should be normalized to 1.

What I find very confusing about the notation is that you write L(θ|x) but the two symbols are actually reversed P(x=X | θ) in the definition. So when you do a maximum likelihood estimation you choose the most promising θ, and in the 2 head extraction you should pick θ=1 and non the θ=0.5 of a regular coin.
If you reverse the parameter and consider P(θ|X=x) you are considering the POSTERIOR, so the probability of θ given that x is already happened. If you are considering just P(θ), unconditioned to x observations, you are considering the PRIOR. Applying the usual Bayesian rule:

the POSTERIOR is proportional to the LIKELIHOOD times the PRIOR. P(x) is called MARGINAL and is the probability of having X=x given all possible θ values – a regularization term.

Not easy to remember, but at least the posterior is posterior because it comes after x is know, and is obtained from the prior. Wikipedia page on posterior probability is https://en.wikipedia.org/wiki/Posterior_probability.
Given this knowledge one can give a sense to phrases like:
In explicit regularization, independent of the problem or model, there is always a data term, that corresponds to a likelihood of the measurement and a regularization term that corresponds to a prior. By combining both using Bayesian statistics, one can compute a posterior, that includes both information sources and therefore stabilizes the estimation process. By trading off both objectives, one chooses to be more addictive to the data or to enforce generalization (to prevent overfitting)
https://en.wikipedia.org/wiki/Regularization_(mathematics)
I wanted to understand what regularization is, and phases like the above one did not make sense before checking the prior and posterior definition. For regularization they mean to chose a target function to be minimized that depends both on the evidence x=X and P(θ) the prior. In deep learning there is often a problem of θ parameters becoming bigger and bigger, they can diverge to infinity while being estimated with the gradient descend. It is unrealistic to allow θ values going to infinity, you can introduce a term that depends on | θ | and penalizes these estimation.

This from the definition on Wikipedia’s page, but it does not put in evidence the dependence on θ. It should be something more like:

You search the θ that minimize a function composed of the prediction error plus the modulus of the parameters multiplied by an hyper-parameter lambda. Here y_i is the i-th result and xi is the i-th input. The sum of squared differences and the use of euclidean distance are arbitrary, and many experiments have been made with different functions. Also the formula reminds the bayesian one, but you do not see any probabilities there.
In machine learning papers often two regularization functions are used, L1 and L2. L2 (also known as Ridge regression) is the usual |θ| = sqrt(sum θ_i ^2) while L1 (aka LASSO) is just sum(abs(θ_i)). You could also introduce L0, which counts just the number of nonzero parameters. LASSO and L0 helps sparsity because promote zeroing out parameters, and this in general is good because makes models more explainable and robust: only fewer dimension will influence the model prediction.

In “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” also another regularization was presented and supported because was improving model performances: the MAX-NORM, that is just L2 with | θ | < c an arbitrary constant (usually 3 or 4). Using it you constrain the θ parameters to stay withing an hyper-sphere of c radius so that networks parameters will never get too high. If you want to know more about max-norm the original article is https://home.ttic.edu/~nati/Publications/SrebroShraibmanCOLT05.pdf , but it is a very technical math paper.
Reading the regularization page you will see also other interesting points: how regularization is used in least squares estimation, the fact that early stopping can be seen also as a form of regularization, and much much more.
What is “Dropout” in neural networks
Neural networks are composed of layers of neurons arrays, and are trained using back-propagation. Since these structures are composed of thousand of elements and millions of connections it is hard that structure and specialization of some neuron sets emerges during the training. Often neurons co-adapt, they provide the desired output but considering odd relations on input data. The “dropout” technique has been introduced in 2014 and promotes neurons independence and specialization, improving the prediction performance.
I am referring to this paper:
Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov
Journal of Machine Learning Research 15 (2014) 1929-1958
It is easy to explain what dropout is: suppose you have a 2 layer network, one composed of 100 neurons and the other of just 10. During a normal training all the parameters of these neurons are tuned together. With dropout instead, you turn on or off each neuron independently from the others during all the training time. For instance the first samples minibatch will be used to train neuron 1, 7,8,14… while the second minibatch will be used for 2,5,7,11,14… It is like at each time you train a different model, but this is not completely true because a neuron keeps its own parameters during all the training. Let’s p_1=80% be the probability of keeping active one neuron in layer 1, and p_2=50% be the probability of keeping active one neuron in layer 2. In this case you do not train all the 100 neurons by 10 neurons for all the training, but at each step you train a subnetwork of 80 by 5 neurons.

There is one point to be considered about the parameters learned. At the training end, how do you use the model? You simply turn on all the units, BUT YOU SCALE THE LEARNED WEIGHTS WITH P, this because you trained many many smaller models and theirs weight must be adjusted to maintain the right proportions. The final model has more neurons than those used during the training, in the example above was 100*10 instead of 80*5. You are creating a larger model by combining together an ensemble of many smaller ones. This differs from the usual ensemble approach, because the model trained here have the same architecture and shares parameters between them. In classical ensembles, you put together model with different architectures trained independently. One note about Restricted Boltzmann Machines / Deep Belief Networks : you use p just on the hidden layer, not on the visible one, see the paper for details.
What are the effects of dropout on the trained model? The best thing I can do to explain it is copy paste a couple of picture from the original paper

In the picture above you clearly see that the dropout filters are very specialized, they work on specific part of the image. The classical learning filter instead are blurry and do weird correlation between distant pixels.
One another important point is about activation and model sparsity: see for instance picture 8 in the original paper

By applying the dropout we reduce the number of neurons used to compute the results: less neurons are involved, with higher activation. This is beneficial because it will make the model more robust – one output will depend on less inputs, so a small change in something unrelated will not make the prediction change. Also the model will consume less energy to compute the result, because many neurons will not work and produce just 0 output.
The suggested value of p is 80% for input layer and 50% for the hidden layers, but this is just another hyperparameter and you will need to do some experiments to tune it.
So why not applying dropout each time? Training will require more time! You are training many different models and putting them together, in the end you will have a better model, but it will take 2 or 3 more time.
If you read the original paper you will find details on the math behind this approach. Activating randomly the neurons can be seen as a Bernoulli extraction, and you could use other distributions like the Gaussian instead, Averaging the models together can be done in other ways than just scaling the parameters, but this method works quite well in the end. All the method can be seen as a regularization approach, like adding noise in denoising auto encoders. You will also learn about interaction with pre-trained model (see my previous post on deep belief networks), and see many many examples on where drop out can be used and how much it increased the performance.
How a simple activation function changed deep learning
Activation functions are at the heart of each neuronal unit: once the weights multiplied with the inputs and a bias added, this linear value is passed through a nonlinear function, and this become the input value for the next neural network levels. In the past just a couple of functions were used, the sigmoid and the hyperbolic tangent: these two functions are quite smooth on the borders and at some point it has been discovered that a much simple function could do the work better. This post resumes the content of “Deep Sparse Rectifier Neural Networks” appeared in January 2010 into the Journal of Machine Learning Research, were Xavier Glorot, Antoine Bordes and Yoshua Begio changed again the deep learning landscape.
Below you see the definition and a picture of the most common activation functions:


The sigmoid gently goes from 0 to 1, while the tanh goes from -1 to 1; I have already encountered them while reading of Long Short-Term Memories (LSTM) where they are extensively used to block or propagate memories of previous events. You see there are a lot of exp functions to calculate, by converse the hinge is just trivial and does not require complex hardware. The hinge just says “below this hyperplane I do not care what happens”.
Coming back to the paper, in 2006 Hinton introduced his layer-by-layer initialization procedure, that allowed training deep networks from a good starting point. The authors were investigating why the usual training procedure was not working well, and Hinton’s one was better. They decided to experiment with the simple rectifier function, and they figured out that it was giving very good results; so good that in many cases the pre-training procedure was not needed anymore.
One caveat is that rectifiers need also an L1 normalization, as the function to the right of 0 is purely linear the model can be affected by unbounded-activation problems: the coefficients can grow and grow in values as there is no penalty for using very high values for coefficients. The regularization keeps the learned parameters low penalizing models with very high coefficients. Sigmoids and tanh in that case saturates to 1.
It is not a limitation to have an asymmetric function as the rectifier: in case something symmetrical needs to be learned, it is possible to do it with a couple of rectifiers with the sign swapped. Some more neurons can be needed, but with a much simpler activation function. The paper does interesting comparisons with the “leaky integrate-and-fire” model used to represent real biological neurons activation function: also this model presents an asymmetry at 0, thing that makes tanh models implausible – they force an unnatural antisymmetry in the learned model.
The paper also describes another point: the percentage of neurons producing something different than 0 is very low in nature (1 to 4%) while with sigmoids all of them are calculating something more or less at 0.5. A model with rectifiers is much closer to what happens in nature: when something is not interesting 0 is propagated, no energy will be required, clear paths are selected in the network. The sparsity is an interesting point, also because it is a sign of disentangling: only some changes to some variables affects the output, with very entangle models instead a small change in whatever variable affects the output. A disentangled model is therefore more robust and explainable, just some paths are taken to compute the result.
Another point is the gradient vanishing problem, as the rectifier is not smoot at all, the gradient will be propagated to back to the previous network layers, and training will be possible even with many many layers stacked. A potential problem, by converse, is that when the rectifier outputs 0, there is no gradient at all propagated – it is just a flat zero line! But after some experiments with a comparison with a smoothed version of the rectifier, they realized that this is not really a problem. In some architectures, like denoising autoencoders for unsupervised learning – some the softplus function or a normalization will be needed to avoid this 0-ing out problem; see the paper for details.
The authors report the result of many experiments. On image recognition, the rectifiers have been able to learn a very good model without the pre-training procedure, and this creating a good sparse model. In tasks such as sentiment analysis instead, the pre-training is still useful even though if many classified samples are available it becomes less useful.
The authors have done a great job in making this complex matter clear and understandable, I really encourage you to read the original article if you have time.