Archive for April 2023
Deep Belief Networks overview
This model has a vaste literature and its roots in physics: it is not easy to approach it, but it has given me many hints on how neural networks can be inialized and how generative models works.
A couple of weeks ago I was reading a paper on neural network initialization: there was a short deference to deep belief networks (DBN) and I wanted to know more, as I was searching for a more teorethical work. I decided then to read:
A fast learning algorithm for deep belief nets
Geoffrey E. Hinton, Simon Osindero, Yee-Whye Teh
Neural computation, July 2006, https://dl.acm.org/toc/neuc/2006/18/7
The article is one of the most referenced on the subject, but it is very difficult to understand, because it requires to have already a background on the terms and the math needed in DBN. I have then found an overview article, which gently introduces the subject:
Restricted Boltzmann Machine and Deep Belief Network: Tutorial and Survey
Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, Mark Crowley
arXiv:2107.12521v2 [cs.LG] 6 Aug 2022
This last article starts reviewing the history of this model and comparing it to other models like Boltzmann machines. In the beginninning there was the Ising model, a phisical model created to explain interaction between electrons with different spins. As it has been born in the physics field, this model has a link to concepts like Entropy, Energy and Hamiltonians. It is for this reason that the term energy pops out from times to times when reading the first paper; it is said Boltzmann machines are energy based models.
But what are these Boltzmann Machines, and how do they relate to neural networks and machine learning? Well suppose you have two neuron layers: the visible layer and the hidden layer. Neurons in the visible layer are directly connected to visible inputs: for instance each pixel of an image. The first paper is on recognizing handwritten digits, so let’s say visible neurons are connected to pixels of the input image. The hidden layer neurons are there to map the visible input to latent variables, for instance the digits you have to recognise. So the number of input neurons is different from that of the hidden layer.
In reality, for the digit recognition problem, the 1st article propses a 28×28 input grid connected to 500 neurons in the 1st layer, connected to a second layer of 500 neurons, connected to a 2000 unit layer, finally connected to a 10 neuron layer – one for each digit. By the way this is a DBN and not a Boltzman machine, that has just 2 levels.
Coming back to the Boltzmann Machine, it has not just links between visible and hidden neurons, it has also connections between visibile neurons and connection between hidden neurons. To describe a BM you need therefore 3 matrixes, one representing the weights of connections between visible and hidden units, one for connections between visible units and one for connections between hidden units. You will need also 2 vectors of biases for the 2 neuron sets.
Another original property of this model is that these links are bi-directional: you do not just propagate the values from the input layer to the hidden, but also the other way back. This is a particularity of Markovian networks, when you have directed links instead you are dealing with Bayesian networks. The procedure used to draw samples from BM is called Gibbs Sampling: you start from a random point – not necessarily a valid one – and you extract one variable at the time, considering the conditionate probability of that variable to all the other ones. You repeat this for all the variables, and you do it again some times until stabilization. The math literature says that this process will not require too many iterations to reach stability.
As you may imagine, the fact of allowing interaction between hidden and hidden variables, and visible and visible variables, complicates the model and practically may not be necessary. We speack then of Restricted Boltzmann Machine when you only have one weight matrix, that is the interaction between visible and hidden variables. This matrix is simmetric, because the links are bidirectional and its diagonal is filled with zeros, as one variable does not depends from itself. The coefficients will be learned using maximum likelihood estimation.
It is also remarkable that these networks can be considered associative memories. The authors of the second paper refers to this as Hopfield networks: some variable are gated by a coefficient theta, if they are above theta they are considered 1 otherwise 0 (or -1).
Coming back to RBM they have anoter important property: conditional indipendence of variables, this because hidden and visible variables do not have direct connections among them. Each visible variable is independent from the others visible variables, and the same holds for hiddend variables. This makes the formals treatable and lead to simple multiplication of probabilities. The formulas are derived in the second paper. Also in the second paper is explained how to concretely implement the Gibbs sampling, it is an iterative generation of probability of hidden variables given the visible variables, followed by genertion of proability of visible variables given hidden variables. This repeated some times.
Given this procedure, it is also possible to do something unexpected: what if instead of starting from visible variables (the pixels of the digit value) you start from the hidden variables? The hidden variable represent the latent classification, so you could say that you fix the variable used to say that the class is an eight and you look at which pixels are activated in the image after some iterations. It is in this way that RBM can be used as generative models and, looking in the first paper, you will see some examples of machine generated digits (figure 8 and 9).
But how all of this relates to deep learning? In the end here we have just one layer of visible variable and one layer of hidden variables! Let’s start with the input, they will be the visible layer. The first network layer on top of them will be the hidden layer. You can train this submodel independently form the other layest as explained in the first paper. When this is completed you move up of one layer: the hidden variables will become the new visible layer and the untrained layer above the new hidden variables. You repeat this level by level until you reach the top layer of your deep network.
Reading the first paper – even if it is very complex and requires a lot of math – you will realize that this leads toward network training. It is actually a very good starting ponit, and after that you will have a partially trained network that will not suffer of the varnishing gradient problem. In a second phase you can use the gradient descent to fine tune the network, and obtain better results. This has been a majour breaktrough in 2006 and opened the road to deep learning, before this it was not possible to practically train a deep network.
So to initialize the weights of a neural network you can use RBM, they have an huge math background justifying them and have proven theirs capacities. There are many extension referenced from the second article: dealing with multivariate variables, continous variables, introducing time evolutions… It has been definitively worth spending these 2 weeks reading about them.
Weight initialization in neural network training
The neural network training algorithm changes the weigth associated to neural inputs an biases so that the network learns to reproduce the desired output. But how do you initialize all these weights before starting the training? Is there a way to initialize them to speed up the learning?
From my previous reading I understood that neural networks implements non convex functions – functions that have multiple local minimums. These minimums can be a trap: if the training algorithm does not explore enough of the parameter space, the model will not evolve and will produce sub-optimal results. Ideally we would like that the training discover the absolute minimum of the function, so that it can produce the best possible results. We search for a minimum because it represent the minimum possible difference between predicted and training results.
The role of parameter initialization is therefore to chose where the minimum search will start in the parameter space. If we start the search close to the absolute minimum, the trainin will be quick and provide the best results. If we start from the wrong place, the training could even not converge at all.
But is it really possible to do a good initialization? In the end we do a complex and long training because it is not possible to figure out the good model directly. This week I read this paper:
A Weight Initialization Method Associated with Samples for Deep Feedforward Neural Network
Yanli Yang, Yichuan He
ICCDE ’20, January 4–6, 2020, Sanya, China © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7673-0/20/01
https://doi.org/10.1145/3379247.3379253
The authors acknowledge that the most widely used algorithm to initialize the parameters is just a random choice. In the end if you do not know where the minimum is any place is a good starting point. They also speaks of other methods like using stacked auto encoders or deep belief network, but really too quickly in a short paragraph; they did not satisfy my curiosity.
The authors suggest to initialize the Fast Forward Neural Network using this procedure:
- First of all you extract randomly the values for all the parameters; let’s call this set of parameters W0 – weights at time 0
- You then train the network for just one epoch. At this point the parameters will be evolved into W1 – weights at epoch/time. Notice that these weight will incorporate some information coming from the training samples, not too much as is just 1 epoch.
- Restart the training with W* = a W0 + b (W1 – W0), where a and b are two coefficients, let’s say a=0.9 and b=0.3
This W* has therefore a random component and a data driven component. From theirs experiments this allows to reduce of hundred of epochs the training. This is amazing, as W1 comes from just one epoch training, actually it puzzles me. Why just continuing the traning from W1 is worst than restarging from W*? The paper does not provide a teorical explanation, but just concludes telling that more experimentation is needed. It reminds me about curriculum learning: you start training the model with simpler examples and then you move to complex things, trying to move the parameters search more clores to the optimal minimimum. With this initialization you maybe start moving in the right direction, but it seems just an intuition. I am quite unsatisfied of the lack of theory behind the results presented, I will look for more refences to this techinque in the following weeks to seek for some confirmation.
AI Rewrites Coding
Do you want to have an overview of existing tools and some insight on how it works? Read this article on the ACM site: https://cacm.acm.org/magazines/2023/4/271227-ai-rewrites-coding/fulltext
Reading the text you will find references to the projects in IBM, Google and Amazon. Existing solutions are limited to generate 15/20 lines of code for you; looking at the video embedded into the page you can get an idea of how this will be improved in the future.
In about 20 minutes you will get explained that AI/deep learning can provide candidate programs for your problem, and a symbolic process can analyze and prune those that will not lead to a correct solution. Really amazing
Trying Kubeflow
This week I decided to focus a bit on learning Kubeflow instead of reading a research paper. During this year I tried to follow Andrew Ng advice: read at least one research paper a week – actually he suggests reading 2 but I realize I don’t have enough time to do that. At least during these months I felt more happy and I realized I am really learning new things.
Now I would like to put in practice some things I have learned and I tried to approach Kubeflow. On the paper it is very powerful an you could put it on any cloud provider and start cooking your stuffs; in reality I feel like I bumped into a wall. Ouch, it hurts!
First you need to install a ton of operators and component into kubernetes. Ok they provide the templates to do that, and probably you can install all locally with some minikube like environment, but we tried to do that on Azure to see how it should really be. The fact is that you need something to pay for it, you need to have a credit card in your account, and the big company is putting a lot of constraints on how containers and networks should be deployed, which sites you can reach from your pods, how you can connect to them.
I started thinking that anybody else in a start-up can just have a cloud subscription and install the things plain vanilla, while I had to struggle for days, and ask for help. Finally I am not even sure all the issues are solved, just some modules start and I can reach an UI but only via a bridge server. Not really a productive way of working – especially because I just want to try the tool and understand if it can be useful in future.
Once accessing the UI I started having a look at the MNIST example. I have to run the examples on Azure and if you search in the example sources you will see that Azure is not that really present. At least the mnist example seemed clear. You have a notebook that prepares for you an image, containing your model, and you deploy and run it; easy no? No
First you need to do the kubernetes set-up, create storage accounts, create secrets into a namespace, create namespace to runs the notebooks, configure a docker registry… Also the example says it has been tested with a specific image version, but searching for it I did not find it. And once Jupyter is working I realized locally something was missing to have the kubectl api working. I needed to figure out that you can set up a kubectl connection with some azure command line tool, and that all the configuration and secrets go into a .kube/config file that the you can copy on your pod.
Then I started running the code and I saw error messages about module versions incompatibilities… the image I am using is probably newer that the example and, bha I started changing things until something started working.
Now I am stuck with some kubeflow fairing issues: in the code it is checking to see if a secret exists, and the library used is not ok. it says the method is not present. Sure another version incompatibility issue, not an happy Sunday morning experience.
I started thinking that we should just chose an environment available as a service, like azure machine learning studio, and then focus on how to run the model anywhere on other cloud providers. Setting up a whole environment like kubeflow is too complicated for a guy in few days, a more decent amount of time should be allowed. You should also make sure that guys with competences on the cloud provider and the security are available around, because theirs help will be precious.