Back on Hyper-parameter importance

Reading the LSTM article on comparing different architectures (“LSTM a search space odissey”) I was curious to understand better theirs approach on identifying important parameters. They referenced this article:

An Efficient Approach for Assessing Hyperparameter Importance. Frank Hutter, Holger Hoos, Kevin Leyton-Brown. Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32.

The technique used is called f-ANOVA, which stands for functional analisys of variance. Before looking at I wanted to read a bit on simple ANOVA, just to understand a bit more: as the reading is about statistics, it was really scaring to me – really not my favourite subject.

Luckily the web is full of introductive pages; I read this one “A Simple Introduction to ANOVA (with applications in Excel)“. In a nutshell an anova example is:

A recent study claims that using music in a class enhances the concentration and consequently helps students absorb more information. We take three different groups of ten randomly selected students (all of the same age) from three different classrooms. Each classroom was provided with a different environment for students to study. Classroom A had constant music being played in the background, classroom B had variable music being played and classroom C was a regular class with no music playing. After one month, we conducted a test for all the three groups and collected their test scores.
See “A simple introduction to anova” to know how it can be done

Here the ANOVA test compare more than 2 categories (we have A,B and C groups) and checks if the average is the same or if there is a significative difference between the groups averages. If you have just 2 categories you can use a different test, and also the test itself just answers if the difference is significative, you need to dig more if this is true.

Trying to dig more, I have quickly found many references to complex statistic subjects, so I retained that it is mandatory to find a tool that does the analysis for you, otherwise the risk of making mistakes is too high and already interpreting the results it is not easy.

After that I have come back to the article on f-ANOVA, and luckily for me it was easier to understand that those on statistics.

The idea is more or less this one: we have one loss function that depends on many hyper-parameters, we would like to understand how much each parameter influences this loss function. In this way we spend more time on the important ones.

As in the previous post on searching good hyperparameters values, it is needed to approximate the real loss function f with another one f’ that can be quickly evaluated. As hyper-parameters can be continous (ex. drop rate), ordinal (number of layers), or discrete (training algorithm) they propose to use random forest models to build f’.

To understand quicky what a random forest tree is, you can read https://www.javatpoint.com/machine-learning-random-forest-algorithm . Many decision trees will be built, on each node there will be a test on an hyperparameter value, attached to each leaf there will be a prediction on f’ value c_i. The final predicted f’ is calculated averaging the c_i.

Usually what we do to tune a parameter is just explore the f values we can achieve keeping all the other hyper-parameters fixed: this is bad as we explore the response of f just locally to the fixed hyper-parameters. We do not know if with other fixed values it will be much different. Now that we have an estimated f’ we can do much better.

For instance we can chose one single hyper parameter i, and evaluate the variance of f’ on all the possible values of the other parameters. This is feasible as f’ can be computed quickly, and it turns out the result is linked to the size of the configuration space possible for the i values compared to the all possible parameter configurations. In the end you can decompose the variance V of f’ on the search space into the sum of contributions V_i of each parameter.

Once selected the most important parameter, you can even take 2 of them and plot in 3D how the variance changes, to understand how they are correlated.

This is way too hard to do without a tool. Next week I will look for existing tools for this and the hyper-parameter search, to understand if it is really feasible to do these analysis.

Written by Giovanni

October 22, 2022 at 8:22 am

Posted in Varie

Giovanni Bricconi