Better than grid or random search
I continued reading the links on hyper-parameters optimization and I have found another interesting article:
Algorithms for Hyper-Parameter Optimization, James Bergstra, Rémi Bardene, Yoshua Bengio, Balàzs Ké́gl, https://proceedings.neurips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf
As I saw with Ray, there are many algorithms that can be exploited, and I wanted to learn more. Is there a better way than just extracting parameters values and trying them to tune a model? The article begins with this phrase:
Several recent advances to the state of the art in image classification benchmarks
have come from better configurations of existing techniques rather than novel ap-
proaches to feature learning. Traditionally, hyper-parameter optimization has been
the job of humans because they can be very efficient in regimes where only a few
trials are possible. Presently, computer clusters and GPU processors make it pos-
sible to run more trials and we show that algorithmic approaches can find better
results
The authors suggests that choosing the hyper-parameters should be regarded as an outer loop in the learning process: in the outer loop you select the hyper parameters, in the inner you train the model, and you keep the model that provides the best results. Once selected the best hyper-parameter you can do a better training with more examples, and finally you have a model that can be exploited. It is therefore important to allocate enough resources and cpu time to the hyper-parameter selection, otherwise you will train a sub-optimal model.
But how can we do better than random or grid search? In my understanding grid and random search work in this way: you select some parameters, you train the model on some sample data (not all because it takes too much time), and you continue until your results (loss function value) do not improve much. Training the model is a very expensive step.
The “trick” is that, after some random/grid iteration, you could start building a model of the loss functions based on the prameters used and the results achieved (they do so after 30 random experiments). If this model is much less expensive than the training, you can then use some other algorithm to guess interesting parameter combinations to test, for instance using some gradient driven method. You choose the next candidate parameter combination in a better way than just extrancting a random combination.
Reality is more complicated than this, and an algorithm that wants to do so needs to take into account that there are relations between the parameters to explore. For instance the number or neurons in layer 2 makes sense only if you decide that your model will have 2 layers. Parameters can be continous (ex. drop rate), ordinal (number of layers), or discrete (training algorithm). The Authors say that the hyper-parameter configuration space is graph-structured.
The paper continues describing two approaches, one based on Gaussian Processes and antother on Parzen-estimators. With gaussian processes the loss function is modeled with gaussian variables; once the model is created, a new candidate point can be searched using exhaustive grid search, or some more complex algorithms, they refers to EDA and CMA-ES. With the Parzen Estimator approach, a reference y* loss function value is taken into account, this is not the best value obtained so far, but some quantiles away from that. The model created predict which parameter combinations will have a certain loss value, that is quite the reverse of the gaussian process where you predict the loss function that will have a parameter combination. From the experiments reported, the second approach is the one that gives bettere results (they experimented selecting 32 hyper-parameters).
The software they developed is called “Hyperopt”: https://github.com/jaberg/hyperopt
Leave a comment