Priors, posteriors and regularization
While reading papers on machine learning I have often seen these terms used. My goal this week was to find a paper on these concepts but in the end I am using Wikipedia to understand them. This is a short summary to avoid forgetting them.

Let X be the result of one or many experiments, X is called EVIDENCE. In machine learning we want to learn a model, based on the evidence, that can be used to predict new results. The model will be based on PARAMETERS, let’s call them θ. Remark: both are vectors and not scalars.
Now we can now look at conditional probabilities:
is the evidence probability given the parameters, it is called LIKELIHOOD function. I have encountered this term in a statistic class about estimation: you want to estimate the parameters given the evidence, and then you search the argmax
, the θ that maximizes the probability of obtaining the evidence.
Reading https://en.wikipedia.org/wiki/Likelihood_function some details are highlighted: X in considered a random variable and the observation should be labeled x (vector). It is then possible to distinguish between

and
P( θ | X=x).
The first one, L, is the probability of obtaining x given a specific θ. The second one is the probability that θ is the right parameter given the evidence. It is not possible to conclude that they are the same if you don’t have more knowledge on the process. The likelihood page gives the example of flipping twice a coin given θ the probability of extracting head. θ can vary between 0 and 1, a regular coin has θ=0.5. You can plot a graph of L given that two head have been extracted: for θ=0.5 it is 0.5*0.5, if θ=1 always head then L=1*1. Actually L=θ*θ in this case and the plot is just a parabolic curve, and it is not a probability! the area behind the curve is 1/3, while a probability should be normalized to 1.

What I find very confusing about the notation is that you write L(θ|x) but the two symbols are actually reversed P(x=X | θ) in the definition. So when you do a maximum likelihood estimation you choose the most promising θ, and in the 2 head extraction you should pick θ=1 and non the θ=0.5 of a regular coin.
If you reverse the parameter and consider P(θ|X=x) you are considering the POSTERIOR, so the probability of θ given that x is already happened. If you are considering just P(θ), unconditioned to x observations, you are considering the PRIOR. Applying the usual Bayesian rule:

the POSTERIOR is proportional to the LIKELIHOOD times the PRIOR. P(x) is called MARGINAL and is the probability of having X=x given all possible θ values – a regularization term.

Not easy to remember, but at least the posterior is posterior because it comes after x is know, and is obtained from the prior. Wikipedia page on posterior probability is https://en.wikipedia.org/wiki/Posterior_probability.
Given this knowledge one can give a sense to phrases like:
In explicit regularization, independent of the problem or model, there is always a data term, that corresponds to a likelihood of the measurement and a regularization term that corresponds to a prior. By combining both using Bayesian statistics, one can compute a posterior, that includes both information sources and therefore stabilizes the estimation process. By trading off both objectives, one chooses to be more addictive to the data or to enforce generalization (to prevent overfitting)
https://en.wikipedia.org/wiki/Regularization_(mathematics)
I wanted to understand what regularization is, and phases like the above one did not make sense before checking the prior and posterior definition. For regularization they mean to chose a target function to be minimized that depends both on the evidence x=X and P(θ) the prior. In deep learning there is often a problem of θ parameters becoming bigger and bigger, they can diverge to infinity while being estimated with the gradient descend. It is unrealistic to allow θ values going to infinity, you can introduce a term that depends on | θ | and penalizes these estimation.

This from the definition on Wikipedia’s page, but it does not put in evidence the dependence on θ. It should be something more like:

You search the θ that minimize a function composed of the prediction error plus the modulus of the parameters multiplied by an hyper-parameter lambda. Here y_i is the i-th result and xi is the i-th input. The sum of squared differences and the use of euclidean distance are arbitrary, and many experiments have been made with different functions. Also the formula reminds the bayesian one, but you do not see any probabilities there.
In machine learning papers often two regularization functions are used, L1 and L2. L2 (also known as Ridge regression) is the usual |θ| = sqrt(sum θ_i ^2) while L1 (aka LASSO) is just sum(abs(θ_i)). You could also introduce L0, which counts just the number of nonzero parameters. LASSO and L0 helps sparsity because promote zeroing out parameters, and this in general is good because makes models more explainable and robust: only fewer dimension will influence the model prediction.

In “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” also another regularization was presented and supported because was improving model performances: the MAX-NORM, that is just L2 with | θ | < c an arbitrary constant (usually 3 or 4). Using it you constrain the θ parameters to stay withing an hyper-sphere of c radius so that networks parameters will never get too high. If you want to know more about max-norm the original article is https://home.ttic.edu/~nati/Publications/SrebroShraibmanCOLT05.pdf , but it is a very technical math paper.
Reading the regularization page you will see also other interesting points: how regularization is used in least squares estimation, the fact that early stopping can be seen also as a form of regularization, and much much more.
Leave a comment