Archive for July 2023
RISE: Randomized Input Sampling for Explanation (of black-box models). Why a black sheep is a cow
I recently attended a presentation about machine learning explainability. The team from Sopra-Steria was presenting theirs work on using submarine sonar sensors to detect internal equipment failures. Each machine in the submarine emits some noise, analyzing these noises they wanted to detect when a component was starting to fail, for instance a defecting ball bearing in a pump. Once the model build and show that it was obtaining good prediction scores, they faced many questions coming from navy engineers. The studied the field for years, had a lot of background and wanted to understand why the neural network was giving a specific result. A difficult challenge they solved using RISE.
RISE: Randomized Input Sampling for Explanation of Black-box Models
Vitali Petsiuk, Abir Das, Kate Saenko
https://arxiv.org/abs/1806.07421
Let’s leave the submarine sound word and come to the paper, that is instead centered on image classification explainability. Look at the picture below:

Here the question was: “why in this picture the AI model detects a sheep and a cow, and not just sheeps”. As these king of models have millions of parameters, understanding why from that point of view is impossible. The authors used a black-box approach, that produced the 2nd and 3rd picture, showing that the model is unable to recognize the black sheep as a sheep. As the model is black-box, it can be applied to any model, not just neural networks.
The idea is surprisingly easy to explain. Given the original image we have some classification probabilities: 26% sheep and 17% cow. Let’s focus just on the cow probability; what happens if I hide a patch of the original image and I reapply the same AI model? I will obtain a different probability. Let’s say 16.9 if I hide a part of the water, and 15% if I hide the black sheep’s legs.
If we repeat this patch and evaluate loop many many times, we can do a pixel by pixel average and decide that some pixels are more important because they drive up the probability. In the end we can paint in red the more important and in blue the others, obtaining the interesting picture above.
Of course I am over-simplifying the problem: how many times do I have to do this? How big must be the picture patches? How do I patch the image, turning the pixels to gray? to black? blurring them? Turning the pixels to black and using a sort of sub-sampling grid to decide where to put the patches seems the better approach.
To evaluate and compare RISE with other methods (such as LIME) the author presented the “deletion” metric. Look at this picture:

On the x-axis a measure of how much of the original image has been hidden, before applying the AI model. On the y-axis a measure of the classification probability. Removing a very small part of the image, but from the importance hot-spot, makes the probability drop. It means that the RISE method is doing good at identifying the hot-spot.
A complementary metric can be introduced reversing the approach: how much the probability rises giving more and more pixels; this is the insertion metric.
To conclude: you can obtain nice images explaining the hot-spots that made a class be chosen, but you have to evaluate the model on thousand of “altered” inputs for a single input instance. In the submarine case, for a 30 seconds sound track it was needed half an hour elaboration to provide an explanation.
Joint-Embedding Predictive Architecture (JEPA): efficient learning of highly semantic image representations?
Some weeks ago I saw on Yann LeCun thread on Linkedin a note on this new paper from Meta: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture, by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas, available on Meta AI site. The paper is about learning highly semantic representations, but what does it actually mean? First of all JEPA uses the self-supervised learning approach: it learns to capture relationships between its inputs, said in other therms learn one part of the input from another part. More concretely JEPA will be trained on a set of images, each image will be split into one context and some targets. Targets are rectangular regions on various sizes that do not overlap with the context (would be too easy to predict targets). Using the context, JEPA will learn a context embedded representation that will predict well the embedded representation of its targets. The key point here is that JEPA will not try to predict pixel-by-pixel what is the target’s content, but instead some low dimensional new representation. Working on this embedding representation makes JEPA much faster than other techniques, and more semantical too.

The authors included one picture that helps comparing JEPA with other existing techniques:

In joint-embedding architecture two inputs are encoded into embedding representations s, and the system learns to use similar representations for similar inputs. By converse in generative architectures an image is encoded and then a decoder, with some control input z, to predict pixel by pixel the other image y. JEPA is the third picture: both the original and target image gets encoded in the embeddings s, and the system learns to represent similar inputs with similar s coordinates.
Other invariance-based methods exists for this kind of task, but they require to provide some hand-crafted similar images to be trained. As you saw JEPA just ask for one single image and generates randomly the context and the targets, which is much more manageable.
Working on semantic representation makes JEPA also much faster, on paper’s figure 1 you can see its performance compared to other approaches such as iBOT, CAE, data2vec etc: we are anyway speaking of thousands of GPU training hours.
Once JEPA is pretrained, it can then be reused as building block to implement specific tasks such as object counting in a scene, depth prediction, etc. It can also be used as input of a generative model, to allow comparing visually the targets with some fakes based on the embedding representation, such as in the picture below:

The paper provides many details on how JEPA was implemented and trained, and many references to other approaches, in just 17 pages.
Principal Component Analysis – reducing data dimensionality
Your data may have been collected using many dimensions, but all of them do convey useful information? There are some techniques that could help in reducing the number of dimensions to handle – thus simplifying the problem and speeding up the analysis. Here I report some links to pages speaking of Principal Component Analysis (PCA), Correspondence Analysis and Multiple Correspondence Analysis.
Let’s start from Principal Component Analysis (PCA). I have found a reference to it in the t-Stochastic Neighbor Embedding paper. In that context it has been used to reduce the number of dimensions to analyze, as that technique has an n2 complexity, therefore is important to limit the number of used dimensions.
PCA has been introduced by Harold Hotelling long time ago, 1933, and you can have a look to the original article on hatitrust.org: Analysis of a Complex of Statistical Variables into Principal Components. This methods works only for numerical continuous dimensions, no way to analyze categorical data with that. The method applies only linear transformations, so it has limitations, but is widely used.
Instead of reading the long original paper, I have found some pages that explain how it works in simple terms: A Step-by-Step Explanation of Principal Component Analysis (PCA). The core idea is to linearly combine the data axes and find a new axis/dimension that captures the maximum of the data variance. The process is then repeated over and over again; at the end of you have an ordered list of orthogonal axes associated to how much variance they capture. You can then decide to keep only the dimensions that covers x% of the initial variance, to simplify the model. You can also discover constant relations in your data, by looking at dimensions with little or no variance. Anyway, the new dimensions will be a mix of the original ones, so they will not be easy to be interpreted as it was before.
Notice also that PCA requires to standardize ( (value-mean)/std-deviation ) the data in each dimension: PCA is sensible to high values and outliers, you do not want to get fooled by that. Once done that, PCA computes the data Covariance Matrix and its Eigenvectors and Eigenvalues. The Eigenvectors are the new dimensions, the Eigenvalues will tell you which is the most important dimension (it is the absolute value that matters, negative values are possible). So instead of searching the most important dimensions one by one, you obtain that with an efficient math procedure. The one by one dimension procedure, was just to simplify the understanding.
As final step, once selected the few dimensions you want to use, you will have to recast the original data points to the new dimensions. You will have the new data to be used during machine learning. See step 5 in A Step-by-Step Explanation of Principal Component Analysis (PCA.
The “Principal component analysis: a review and recent developments” is a longer reading with equations, that contains an interesting example. The Kuehneotherium is a prehistoric animal, some teeth fossils have been found, measured, and the page explains how PCA can be used to get insight on the data.
Kuehneotherium is one of the earliest mammals and remains have been found during quarrying of limestone in South Wales, UK [12]. The bones and teeth were washed into fissures in the rock, about 200 million years ago, and all the lower molar teeth used in this analysis are from a single fissure. However, it looked possible that there were teeth from more than one species of Kuehneotherium in the sample.
Principal component analysis: a review and recent developments
The input data set had 9 dimensions, but after applying PCA they saw that with just 2 dimensions you can explain 78.8% and 16.7% of the data variance… the other dimensions are not really needed. The page also introduced a type of plot called biplot: take the 2 most important dimensions, and project all the data points in this plane, you can see how the data are arranged (if 2 axes can explain enough variance). In this plane you can also plot vectors that represent the original data dimensions, and theirs orientation will give you insight on which variable is more correlated with the others. Visit the site to get more information about biplots: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792409/figure/RSTA20150202F2/

PCA can’t handle categorical data, but for this kind of problem another interesting technique exists: correspondence analysis CA. To apply this technique we start from a contingency table: for instance a frequency table with rows representing some brands and columns representing the frequency a customer associate that brand to a feature (ex. Valentino – luxury). Correspondence Analysis: What is it, and how can I use it to measure my Brand? (Part 1 of 2) is a nice article providing the feature/brand example. In theirs example 5 imaginary soda brands are associated to 3 qualities: tasty, aesthetic, and economic.
The CA method requires you to take some steps: calculate observation proportions (number of time brand A has been associated tasty / all the observations). Once done that you calculate the mass: just the sum of proportions of a single brand or of a single quality. With this information you can calculate the expected proportion that is the product between a brand weight and a quality weight and compare it to the actual observation (this is called residual). These numbers will be very small, but just make the proportion with the expected proportion, you get something much more easy to understand: example brand A is viewed 30% more associated to tasty than expected.
This is useful but not really the same as PCA; PCA was helping you dropping dimensions, so far we just have a way to interpret column and row relations in the contingency table. The following step will be to apply the Singular Value Decomposition to the data. At this step the data will be cast to dimensions that captures decreasing grades of variance, and will give each brand and each quality some coordinates. Just keeping the 2 most important dimensions will let you plot in 2 dimensions the brands position and the qualities positions: you will visually see how close a brand is associated to tastiness, cheapness,…

See for instance also the article How do I judge brans performace… for more insight on how to read this kind of plot.
When data can be arranged non just in tables, but in cubes, or higher number of dimensions, you can apply multiple correspondence analysis (MCA). But just taking some time reading CA vs MCA makes clear that MCA is much more complex to use and in reality CA is much more used.
What to do when you have mixed categorical and numerical variables? Seems not an easy task: searching a bit I have found Factominer : this R package is able to do analysis for instance on a set of wines characterized by categorical values such as brand combined with average acidity score (average vote provided by judges). Honestly I did not look much into it, just in some slides there are notes about comparing covariance matrix of different set of categorical variables.