IBM Data Science training
Learn Python, Pandas, Scipy and Matplotlib quickly, with many practical examples.

For a while I was reading about neural networks, mainly research papers or books. I think this has been very good because I learned much about successful architectures and how they really work. It allowed me to understand a bit what’s behind chatgpt and other deep learning tools.
It is good to have the theory, but at some point you would also like to experiment a bit and see what you can do by yourself. This seems difficult with deep neural networks because you have to pay for a long training, and do many experiments with different hyper-parameters. Also I realized I miss experience with many tools used. I come from C++ and Java development, most of the machine learning tools instead are Python based, run in Jupyter notebooks and require math concepts I do not have.
For these reasons I started looking for some training material: having had a good experience with Coursera I decided to pay myself one training available there: IBM data science. It is actually a group of ten different small training:
- Machine Learning with Python
- Tools for Data Science
- Python Project for Data Science
- Python for Data Science, AI & Development
- Applied Data Science Capstone
- Data Visualization with Python
- Data Science Methodology
- What is Data Science?
- Data Analysis with Python
- Databases and SQL for Data Science with Python
Somebody will not be happy looking at the IBM word, thinking it is just about IBM tools and services. This is not true, I did all the trainings and the IBM specific parts are very limited, and always proposed as optional modules. DB2 or IBM Watson tools are proposed during the training, but always as optional modules and in a non invasive way. If you want you can do nearly all the labs with a local jupyter instance on your laptop. In the end they have been really fair in proposing the content, and all the labs are well implemented. The cloud environment they proposed worked smoothly and was reliable.
The training is full of closed questions quiz – you have to listen carefully to the videos, but the questions are not very difficult. Doing the whole training required me roughly 3 weeks for 8 hours a day: of course it is a remote training and you can follow it little by little whenever you have time, but indeed it is not a small quick training!
You can start with a limited python knowledge, and little by little you become proficient with Pandas, scipy, numpy, dash, jupyter… It has been very useful to me. The contents are mainly practical, I actually need to find something more about the math needed to evaluate the quality of a model.
The final chapter “Applied data science capstone” put all the puzzle pieces together and makes you work on a case study where you will: do data collection, data wrangling, exploratory data analysis, data visualization, write Python code to create machine learning models including support vector machines, decision tree, evaluate the results of machine learning models for predictive analysis, compare their strengths and weaknesses.
If you want to learn quickly machine learning tools like Python, Pandas, scipy and matplotlib IBM data science is definitely something to try.
Vision Transformers ViT
The concept of Transformer is very successful in natural language processing applications, but it is not limited to that domain. With “AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE” Alexey Dosovitskiy, Lucas Beyer et al. prove that transformers can be applied to image recognition. This paper has been presented in 2021 at the ICLR (International Conference on Learning Representations) and can be found on arXiv.
Usually CNN (convolutional neural networks) are used for image processing, this because they can learn some local image features that simplify higher level tasks. The article reports that, when trained with a large amount of data (>14 million images), transformers can do as well or better than CNN. Learning local image features, at few pixels of distance, is not that important.
In NLP a transformer operates on a stream of tokens; with images the authors have done something similar. Each image is divided into patches, and the patches are fed to the transformer. The image from the paper explains clearly what happens.

The image on the bottom left is divided into 9 patches that are given to the transformer as input, just like a phrase in NLP. The first token is special, it is not an image patch, but represent the image class to be learned. The patches pass through a linear projection (learned during the training) that transforms them into D sized latent vectors.
Positional encoding needs to be added to the patches, to allow the transformer to learn about inter patches relations. According to the authors it is not needed to use a complex bi-dimensional encoding, a simple 1D schema is enough. It is possible to deal with images of different resolutions, the transformer will have to deal with a longer stream of patches.
Vision Transformers are usually referred to using a notation like ViT-Base, ViT-Large or ViT-Huge. this refers to the number of layers used. In the paper a ViT-Base has 12 layers and uses an hidden size D of 768, with 12 attention heads, giving 86 millions parameters. The ViT-Huge has instead 632 millions of parameters. Sometimes also the size of the input patch is specified, ViT-L/16 means that a large vision transformer is used with 16×16 image patches. To train such big models it is important to use an huge quantity of images, 300 millions in case of ViT-huge models, otherwise theirs performance will be inferior to other simpler models. The figure 3 in the paper shows that the corpus size is important to achieve good results.
Zero-shot learning
While reading the “Joint-Embedding Predictive Architecture (JEPA)” paper I have found that Vision Transformer (ViT) models are used as model building blocks. I wanted then to understand what attention is an I have read other papers and summarized them in these 2 posts: “Paying attention, but to what” and “Transformer: attention is all you need“. This time I searched more about the model implementation and I have found this article about OpenAI CLIP: a model that can be instructed in natural language to perform a great variety of classification benchmarks on images, without directly optimizing for the benchmark’s performance. This is one example taken from the CLIP web page:

In the above example CLIP has been trained using pairs of images and labels taken from the web: for instance it has been give a dog picture paired with “Pepper the aussie pup” label, and 400 Millions other images. The text-to-picture learned model can then be used to predict that a new unseen image belongs to the “a photo of a dog” class versus the “a photo of a plane” class. Given the “a photo of a dog” and “a photo of a plane” CLIP can create two internal representation of these phrases and match them with the features extracted from a new input image. It chooses the maximum probability phrase as the correct one. The purpose is to have a system that can be instructed with natural language to perform a wide range of tasks on images.
The article is very long and it will require reading a lot of references to understand it; for the time being it guided me learning about Zero-shot learning (ZSL) concept.
In supervised machine learning you have a set of input examples paired with classes labels. Creating such input sets is very expensive and has one limitation: you cannot deal with new classes unseen at training time, or deal poorly classes with very few instances. With Zero-shot learning you want to use a pre-trained model on a new different set of classes not known at training time. For instance you want to use a model that has been trained to recognize “horses” and “stripes” to recognized zebras: it should be possible because the system already has some knowledge about the two concepts and you could define a zebra like “an horse with stripes”. This is called zero-shot learning because the original model has not been given a zebra picture in input during the training. Similarly you could define One-shot and Few-shot problems where the original model has been exposed on just few pictures of that class. Of course this concept is not limited to image recognition and you can think about zero-shot problems for text or audio processing models.
For ZLS to be possible the model must be able to deal with reusable features that can be composed to describe the new target classes that you want to recognize. CLIP is using natural language to convey such kind of features.
As ZSL uses a different set of classes at training and at classification time, it can be seen as a case of Transfer learning: the knowledge build on a task is re-purposed on another different task, avoiding an expensive training from scratch. This is especially important on domains, such as image recognition, where training a new model is particularly expensive as you need to process million of images. Here the classification domains are different at training and run time, so ZSL is a case of Heterogeneous transfer learning.
Usually transfer learning is achieved taking an original deep-learning model, and removing some of the top layers: for instance those that are full connected, followed by the soft-max stage. These layers are then replaced with new one that are trained on the new “transfer” task, while all the other layers remain untouched (frozen). A model can be finally fine tuned to a task, changing all layers weights as a final training strep.
If you want to learn more on transfer learning you can read “transfer learning guide“. If you are interested in ZSL you can read “zero shot learning guide” which provides a classification of different ZSL approaches and the description of many real application of Zero-Shot learning.
Transformer: Attention is all you need
I have spent some time reading about Transformers, a neural network architecture first described in the paper “Attention is all you need” presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017) by many researchers at Google and University of Toronto. The original paper just 9 pages, but many concepts described require further investigation.
During the past weeks I was learning about automatic translations: reading papers about LSTM (long short term memories) and GRU (gated recurrent units) used to compute an internal state h, which is then used by a second neural network layer (decoder) to write the text translation. These systems are effective but they are serial: they scan the source text from the beginning to the end, step by step, to compute intermediate states. This limits theirs scalability, ideally we would like to process the inputs in parallel to achieve higher computing performances.
For instance in “Convolutional Sequence to Sequence Learning” by Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin of Facebook AI Research, a different approach is followed. Convolutional networks can analyze overlapping sub-sequences of the input text. Convolution layers can also be stacked one on top of the others, so that the higher layer can access information about a wider and wider set of original inputs. If the first layers considers 3 inputs at the time, and the second layers 3 inputs of the first layer, these units in the second layer have actually access to 5 input tokens. The memory state computed by GRU or LSTM is actually replaced by the capability of looking at many inputs simultaneously. The results reported in the paper have been obtained with 15 layers in the encoder and 15 layers in the decoder (see paragraph 5.1), for sure it is needed to process long phrases.
This second paper contains also 2 other interesting points: the need of position encoding and the importance of attention.

The convolution does not takes appropriately into account the position of a token in a phase: words at the beginning will have different role than words at the end. To solve this issue they used this approach: each input word is mapped to an embedding representation, a vector of some length f. For each position in the input sequence they computed also a position vector of the same f length that is added to the original embedding. In this way the same input vector carries both information on the token and its position. Seems quite weird not to just keep the position as some new dimension in the input vectors, but according to them it is an effective approach (as shown in the following table).

The first paper describes further how position embeddings can be generated: a really surprising formula that is based on sinusoids.

Using relative position in the phrase, or using the absolute position does not work well: the variance of these vector continues to change for different lengths and this is not good for learning. The sinusoids have the advantage of carrying information on both the absolute an relative position, keeping the variance bounded. A gentle introduction to positional encoding in transformer models provides many useful pictures and code examples to understand the concept.
The second important point is the attention impact. Attention requires a computation effort quadratic with the input length, in case of translation this is not a bottleneck; the Facebook team reported learning performances degraded of just 4% when introducing attention in 4 layers. The table 5 they report show that the more you introduce attention layers the more the translation score improves. With attention in 5 layers they achieve a BLEU score of 21.63, while using only one attention layer gives scores between 20.24 and 21.31.
The Google team decided to introduce an architecture without recurrent neural networks or convolutional neural networks: the results they provide show that the attention mechanism is enough to obtain very good results.

The encoder and decoders are composed of 6 identical layers as those shown above. The inputs a the previous outputs are passed in after adding the positional encoding, the sinusoidal function described few paragraphs before. Each encoder unit is composed of 2 sublayers: one is using attention with residual connections, the other one a feed forward network again with residual connections. Residual connection means taking the input and connecting it to the output adding it to a new component calculated by the neural network; each unit can forward the input as is or be trained to change it a bit by the neural component. The decoder units are more complicated because past output have to be processed with attention, and this is then used mixing it with the inputs. Also the output must be masked because the decoder must use just the previous outputs to compute the new value.
The multi-head attention component is the core concept but reading the paper was not enough to understand well. You can compute different attentions in parallel, each one will deal with a different aspect of you data. To understand it, it is better to watch this video “Getting meaning from text self attention step by step“, the video is very well done and describe also the BERT model. Different attentions are calculated in parallel, it is like the model could access many different simpler models and mixing them together make possible to achieve outstanding results. For instance in the paper 8 parallel attention “heads” are used.
Each input word is associated to an embedding vector describing its meaning. These embedding vectors are compared with all the others present in the input phrase, giving a score that is then normalised by the softmax operation. These values are the attention scores that are then used to create new embedding vectors used by the next layers.
The paper also uses the terms key, values and queries when speaking about attention. The terms origin is not explained, and it has been necessary to search a bit to get an explanation. This conversation on stack exchange provides some clarity: the terms comes from the search engine world, query stands for the input coming from the user, the keys are indexed elements that allows to find some values – the links that are shown in the search engine results. In the end this is not really important, it is better to watch the “attention step-by-step” video to get the idea of how it works. When the attention is calculated the embedding for each word are not the same, there can be query embeddings, key embeddings and values embedding for each word. This gives the model a large flexibility, and event because the model has multiple-heads, there will be multiple key-query-value vectors for each word for each layer.
Concluding with author’s words “For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.“
Paying attention: but to what?
One technique of generating language translations using neural networks consists in using two stages, one called encoder and one decoder. The encoder synthesizes the input text into a fixed size vector, the context, that is then used by the decoder to decide the next word to be produced in the translation. It is intuitive that the size of the vector will limit the translation quality, but actually we do not need to pay attention to the same parts of the input text for each word we are going to produce. The translator will have to pay attention to singular-plural accordance, pronouns, etc. In “NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE” Dzmitry Bahdanau, Kyung Hyun Cho and Yoshua Bengio show how the decoder can focus on different part of the source text to improve the translation quality.
The model has been trained on a set of English and French phrases of different sizes, 50 words maximum in the paper. A comparison plot in the paper shows that the new model is able to perform well translations of even 60 words phrase with no degradation as the phases becomes longer.

How is it possible that the model focuses on different things while writing the text translation? Let’s look a bit into the formulas. The input text is composed of x1, x2, x3… tokens (words or punctuation). The input encoder will produce an hidden state h1, h2, … associated to each single input token. These h elements will be of fixed size. Usually just the last h element is used by the decoder to decide the next output word y, but we can thing about using a more complicated state.


Instead of using the last know h, we can use a function of all the h states, here c stays for the context and i is the index of the current output word y. The context used for the output word yi will depend on all the h encoder states weighted with the alpha weights. Here j stand for the index in the input text x. The important thing is that the context will depends on the i position. The decoder, which produces the translation, will use this context ci, the last produced word y and its own internal state s to decide the next output.


Si is the internal decoder state. Looking at the end of the paper we see how the alpha weights can be computed



Where a is the alignment model. v, W and U will be learned by the neural network. For the output word yi the context used ci will depend all the hidden states for all the inputs, but selecting some of them as the most prominent using alpha i,j. The alpha coefficients will depend on decoder states s and the encoder states h. Notice that it depends only on the last s state, but on all the h states – where some of them will be more important.
Another interesting thing in the paper is that they consider bidirectionally the input text x. So they do not just compute h left to right from the first to the last token, but they consider also the opposite direction, from the text end to the beginning. I suppose this because not just the word before the input token are important, but also those that follows it.

The paper appendix reports many details about the formulas, showing the activation functions used, the fact that Gated Recurrent Units where used, details on the learning procedures used and so on. Useful to understand what these abstract function f, g, and a really are.
RISE: Randomized Input Sampling for Explanation (of black-box models). Why a black sheep is a cow
I recently attended a presentation about machine learning explainability. The team from Sopra-Steria was presenting theirs work on using submarine sonar sensors to detect internal equipment failures. Each machine in the submarine emits some noise, analyzing these noises they wanted to detect when a component was starting to fail, for instance a defecting ball bearing in a pump. Once the model build and show that it was obtaining good prediction scores, they faced many questions coming from navy engineers. The studied the field for years, had a lot of background and wanted to understand why the neural network was giving a specific result. A difficult challenge they solved using RISE.
RISE: Randomized Input Sampling for Explanation of Black-box Models
Vitali Petsiuk, Abir Das, Kate Saenko
https://arxiv.org/abs/1806.07421
Let’s leave the submarine sound word and come to the paper, that is instead centered on image classification explainability. Look at the picture below:

Here the question was: “why in this picture the AI model detects a sheep and a cow, and not just sheeps”. As these king of models have millions of parameters, understanding why from that point of view is impossible. The authors used a black-box approach, that produced the 2nd and 3rd picture, showing that the model is unable to recognize the black sheep as a sheep. As the model is black-box, it can be applied to any model, not just neural networks.
The idea is surprisingly easy to explain. Given the original image we have some classification probabilities: 26% sheep and 17% cow. Let’s focus just on the cow probability; what happens if I hide a patch of the original image and I reapply the same AI model? I will obtain a different probability. Let’s say 16.9 if I hide a part of the water, and 15% if I hide the black sheep’s legs.
If we repeat this patch and evaluate loop many many times, we can do a pixel by pixel average and decide that some pixels are more important because they drive up the probability. In the end we can paint in red the more important and in blue the others, obtaining the interesting picture above.
Of course I am over-simplifying the problem: how many times do I have to do this? How big must be the picture patches? How do I patch the image, turning the pixels to gray? to black? blurring them? Turning the pixels to black and using a sort of sub-sampling grid to decide where to put the patches seems the better approach.
To evaluate and compare RISE with other methods (such as LIME) the author presented the “deletion” metric. Look at this picture:

On the x-axis a measure of how much of the original image has been hidden, before applying the AI model. On the y-axis a measure of the classification probability. Removing a very small part of the image, but from the importance hot-spot, makes the probability drop. It means that the RISE method is doing good at identifying the hot-spot.
A complementary metric can be introduced reversing the approach: how much the probability rises giving more and more pixels; this is the insertion metric.
To conclude: you can obtain nice images explaining the hot-spots that made a class be chosen, but you have to evaluate the model on thousand of “altered” inputs for a single input instance. In the submarine case, for a 30 seconds sound track it was needed half an hour elaboration to provide an explanation.
Joint-Embedding Predictive Architecture (JEPA): efficient learning of highly semantic image representations?
Some weeks ago I saw on Yann LeCun thread on Linkedin a note on this new paper from Meta: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture, by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas, available on Meta AI site. The paper is about learning highly semantic representations, but what does it actually mean? First of all JEPA uses the self-supervised learning approach: it learns to capture relationships between its inputs, said in other therms learn one part of the input from another part. More concretely JEPA will be trained on a set of images, each image will be split into one context and some targets. Targets are rectangular regions on various sizes that do not overlap with the context (would be too easy to predict targets). Using the context, JEPA will learn a context embedded representation that will predict well the embedded representation of its targets. The key point here is that JEPA will not try to predict pixel-by-pixel what is the target’s content, but instead some low dimensional new representation. Working on this embedding representation makes JEPA much faster than other techniques, and more semantical too.

The authors included one picture that helps comparing JEPA with other existing techniques:

In joint-embedding architecture two inputs are encoded into embedding representations s, and the system learns to use similar representations for similar inputs. By converse in generative architectures an image is encoded and then a decoder, with some control input z, to predict pixel by pixel the other image y. JEPA is the third picture: both the original and target image gets encoded in the embeddings s, and the system learns to represent similar inputs with similar s coordinates.
Other invariance-based methods exists for this kind of task, but they require to provide some hand-crafted similar images to be trained. As you saw JEPA just ask for one single image and generates randomly the context and the targets, which is much more manageable.
Working on semantic representation makes JEPA also much faster, on paper’s figure 1 you can see its performance compared to other approaches such as iBOT, CAE, data2vec etc: we are anyway speaking of thousands of GPU training hours.
Once JEPA is pretrained, it can then be reused as building block to implement specific tasks such as object counting in a scene, depth prediction, etc. It can also be used as input of a generative model, to allow comparing visually the targets with some fakes based on the embedding representation, such as in the picture below:

The paper provides many details on how JEPA was implemented and trained, and many references to other approaches, in just 17 pages.
Principal Component Analysis – reducing data dimensionality
Your data may have been collected using many dimensions, but all of them do convey useful information? There are some techniques that could help in reducing the number of dimensions to handle – thus simplifying the problem and speeding up the analysis. Here I report some links to pages speaking of Principal Component Analysis (PCA), Correspondence Analysis and Multiple Correspondence Analysis.
Let’s start from Principal Component Analysis (PCA). I have found a reference to it in the t-Stochastic Neighbor Embedding paper. In that context it has been used to reduce the number of dimensions to analyze, as that technique has an n2 complexity, therefore is important to limit the number of used dimensions.
PCA has been introduced by Harold Hotelling long time ago, 1933, and you can have a look to the original article on hatitrust.org: Analysis of a Complex of Statistical Variables into Principal Components. This methods works only for numerical continuous dimensions, no way to analyze categorical data with that. The method applies only linear transformations, so it has limitations, but is widely used.
Instead of reading the long original paper, I have found some pages that explain how it works in simple terms: A Step-by-Step Explanation of Principal Component Analysis (PCA). The core idea is to linearly combine the data axes and find a new axis/dimension that captures the maximum of the data variance. The process is then repeated over and over again; at the end of you have an ordered list of orthogonal axes associated to how much variance they capture. You can then decide to keep only the dimensions that covers x% of the initial variance, to simplify the model. You can also discover constant relations in your data, by looking at dimensions with little or no variance. Anyway, the new dimensions will be a mix of the original ones, so they will not be easy to be interpreted as it was before.
Notice also that PCA requires to standardize ( (value-mean)/std-deviation ) the data in each dimension: PCA is sensible to high values and outliers, you do not want to get fooled by that. Once done that, PCA computes the data Covariance Matrix and its Eigenvectors and Eigenvalues. The Eigenvectors are the new dimensions, the Eigenvalues will tell you which is the most important dimension (it is the absolute value that matters, negative values are possible). So instead of searching the most important dimensions one by one, you obtain that with an efficient math procedure. The one by one dimension procedure, was just to simplify the understanding.
As final step, once selected the few dimensions you want to use, you will have to recast the original data points to the new dimensions. You will have the new data to be used during machine learning. See step 5 in A Step-by-Step Explanation of Principal Component Analysis (PCA.
The “Principal component analysis: a review and recent developments” is a longer reading with equations, that contains an interesting example. The Kuehneotherium is a prehistoric animal, some teeth fossils have been found, measured, and the page explains how PCA can be used to get insight on the data.
Kuehneotherium is one of the earliest mammals and remains have been found during quarrying of limestone in South Wales, UK [12]. The bones and teeth were washed into fissures in the rock, about 200 million years ago, and all the lower molar teeth used in this analysis are from a single fissure. However, it looked possible that there were teeth from more than one species of Kuehneotherium in the sample.
Principal component analysis: a review and recent developments
The input data set had 9 dimensions, but after applying PCA they saw that with just 2 dimensions you can explain 78.8% and 16.7% of the data variance… the other dimensions are not really needed. The page also introduced a type of plot called biplot: take the 2 most important dimensions, and project all the data points in this plane, you can see how the data are arranged (if 2 axes can explain enough variance). In this plane you can also plot vectors that represent the original data dimensions, and theirs orientation will give you insight on which variable is more correlated with the others. Visit the site to get more information about biplots: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792409/figure/RSTA20150202F2/

PCA can’t handle categorical data, but for this kind of problem another interesting technique exists: correspondence analysis CA. To apply this technique we start from a contingency table: for instance a frequency table with rows representing some brands and columns representing the frequency a customer associate that brand to a feature (ex. Valentino – luxury). Correspondence Analysis: What is it, and how can I use it to measure my Brand? (Part 1 of 2) is a nice article providing the feature/brand example. In theirs example 5 imaginary soda brands are associated to 3 qualities: tasty, aesthetic, and economic.
The CA method requires you to take some steps: calculate observation proportions (number of time brand A has been associated tasty / all the observations). Once done that you calculate the mass: just the sum of proportions of a single brand or of a single quality. With this information you can calculate the expected proportion that is the product between a brand weight and a quality weight and compare it to the actual observation (this is called residual). These numbers will be very small, but just make the proportion with the expected proportion, you get something much more easy to understand: example brand A is viewed 30% more associated to tasty than expected.
This is useful but not really the same as PCA; PCA was helping you dropping dimensions, so far we just have a way to interpret column and row relations in the contingency table. The following step will be to apply the Singular Value Decomposition to the data. At this step the data will be cast to dimensions that captures decreasing grades of variance, and will give each brand and each quality some coordinates. Just keeping the 2 most important dimensions will let you plot in 2 dimensions the brands position and the qualities positions: you will visually see how close a brand is associated to tastiness, cheapness,…

See for instance also the article How do I judge brans performace… for more insight on how to read this kind of plot.
When data can be arranged non just in tables, but in cubes, or higher number of dimensions, you can apply multiple correspondence analysis (MCA). But just taking some time reading CA vs MCA makes clear that MCA is much more complex to use and in reality CA is much more used.
What to do when you have mixed categorical and numerical variables? Seems not an easy task: searching a bit I have found Factominer : this R package is able to do analysis for instance on a set of wines characterized by categorical values such as brand combined with average acidity score (average vote provided by judges). Honestly I did not look much into it, just in some slides there are notes about comparing covariance matrix of different set of categorical variables.
Is it possible to represent high-dimensional data in 2D? More details on t-Student Stochastic Neighbor Embedding (t-SNE)
Last week I have read an article on Stochastic Neighbor Embedding; it was quite old an a more recent and longer one was available. I decided to complete the reading to understand a bit better how it is possible to represent in a 2D plot data that have 10 or more dimensions.
This week article is Visualizing Data using t-SNE by Laurens van der Maaten, Geoffrey Hinton, edited by Yoshua Bengio appeared in the Journal of Machine Learning Research: here is the link and the pdf https://www.jmlr.org/papers/v9/vandermaaten08a.html
Using this technique you can for instance: take the pictures in the MNIST data set (handwritten digit pictures) and from pixel to pixel data be able to put each picture in a plane putting the 7 all together, close to nines etc… something quite magical.

Well, it is possible but it will require a lot of CPU to be performed because: it is really complex to achieve, and requires optimizing a non-convex function, which is a painful task that requires trying many parameters before getting the good picture. Anyway is quite magical seeing the pictures arranged together by visual similarity.
How does it works? Well it is all about the probability that one data point will pick another one as its neighbor, because they are near in the hyper-space. Yes, you need to compare each datapoint with all the others, you have a n^2 complexity, so you will not be able to plot a map of a million datapoints.
Each point X will have associated a position x in the 2D picture: ideally we would like that when the X points are far, also the x points do the same and vice versa. A phrase in the paper is illuminating: “in 10 dimensions it is possible to have 11 equidistant neighbors, and there is no way to model this faithfully in a two-dimensional map”. So what t-SNE does is privileging a bit the concept of keeping closer the points that are similar, the other challenge of keeping apart dissimilar points is less important. The idea is that in the hyper-space you have clusters of data, and you want to see that in the 2D plot.
Coming to math, what is the difference between SNE and t-SNE?

p is the probability that the i point chooses j as its neighbor in SNE. Notice that p_ij is not equal to p_ji, with t-SNE this is simplified using:

Where n is the number of datapoints. Doing this operation also imposes a minimum value on p_ij, which is good to handle cases such as outliers, that are far from all the other points. The σ is chosen point by point, using a meta-parameter called perplexity, that more or less specify how many neighbors you want for each point (5-50 seems a good range). σ has to vary because the density varies from place to place.
The other difference between SNE and t-SNE is the probability of 2 points to be closer in 2D:


the i and j point have p_ij probability of being neighbors in the high dimensional space, they are mapped to 2D image points and there they will have q_ij probability of being neighbors. The first equation is for SNE, and again is asymmetrical, the second one for t-SNE and has been obtained using the t-Student distribution. The second equation solves various problems:
- Helps having always a slight repulsion between distant points, and this helps scattering the point in the 2D space. SNE had “crowding” problem, the points were too often placed near the image center.
- Forces points distant in the high dimensional space to move away from each other also in the 2D space. This was not the case with SNE, as is shown in figure 1 in the paper.
- Makes the gradient used to optimize the 2D points location faster to compute – no exponentials there.
The paper presents also a comparison between t-SNE and other methods on a couple of testing datasets. t-SNE works better, still the author reports some issues:
- as the complexity is quadratic, you can’t apply it directly to huge data sets
- the choice of t-Student distribution may not generalize well to other dimensions other than 2 or 3.
- using euclidean distance in the high dimensional space can work poorly for many data sets
- The cost function to optimize is non-convex: you will need to choose some hyper-parameters and wait maybe some hours to see the result
Is it possible to represent high-dimensional data in 2D?
Last week I was reading an article on Semantic Hashing and I have found an interesting picture: it was representing in 2D some cluster of text documents. Each dot in the picture was one document, some dots were near, some far, each cluster had dots of different colors. It does not seems much, but how do you decide where to put a document on a 2D plane? There were no axes at all in the picture! Moreover in that paper each document was considered equivalent to a vector of word counts, and the vector length was the size of the dictionary. To resume, the picture was projecting point from a space with thousands of dimension onto a 2D surface; a remarkable achievement!

Fortunately in the paper there was a reference to another one: Stochastic Neighbor Embedding (SNA) by Geoffrey Hinton and Sam Roweis, appeared in Advances in Neural Information Processing Systems 15 (NIPS 2002). Here is the PDF link https://papers.nips.cc/paper_files/paper/2002/file/6150ccc6069bea6b5716254057a194ef-Paper.pdf . If you search for Stochastic Neighbor Embedding on Wikipedia you will find also a reference to a more recent and longer document on t-SNA – an evolution of SNA using the t-student distribution.
I did not know, but there are many techniques that try to address the problem of displaying high dimensional data in 2D/3D. In the end even a painter, using the prospective, is representing 3D data on a 2D surface; of course here the problem is much more complex because you have to deal with hundreds of dimensions. In machine learning problems often you have clusters of data that you want to identify, they live in high dimensional spaces but they are somehow restricted to stay only in a region. Here comes the idea, you want to find a criteria to maps closer data to closer dots in 2D. In the SNA paper the authors starts from the usual euclidean distance to derive some probabilistic estimation of the neighbour-ness between data pairs.

here d is the distance between the i and j documents, divided by a sigma constant – x can be word count vectors. Notice that sigma has the i index – it varies from document to document. The reason of this choice seems to be the different data density: somewhere you have closer documents, and in other space regions the documents are much sparser. The sigmas should be chosen to have a good number of neighbors for each point. I understood that this is not easy to be done, and also that d_ij will be different from d_ji with this model.

From the distance they derive a probability that document i chooses document j as its neighbor. Notice that the exponential as the -1 inside, so for high distance you will have a very low probability, and again as d_ij <> d_ij you will also have p_ij<>p_ji.
Each x document will be mapped on an y point in the cartesian plane. You can more or less define the neighbor-probability in the plane with the same formula used for p_ij. In this case they decided to have a single sigma, and not one for each data point. From this they define an optimization problem:

If p_ij is small (i and j are not neighbors), you do not care much what will be the distance between the i and j points in the cartesian plane (q_ij can be small or high, it won’t change much). If instead p_ij is high – i and j are close – you want also q_ij to be high: the dots in the cartesian plane should be close. The usual x and y axis here have absolutely no meaning, just the distance matters.
In the end, to obtain the amazing picture presented in the paper, the method starts assigning a random place to each document in the plane. Then, using a gradient descent technique, the y coordinates are gradually changed to reduce the C function: this means that the method will not be super fast, as it has to work for a while to find a good representation. Also finding some good sigmas seems not easy. But this was just the original idea, I started reading the paper on the t-SNA and seems many of these difficulties were addressed. I plan to finish reading it next week and blog about my findings, stay tuned.