Optuna: efficiently tune your model

I wanted to find a tool implementing the TPE algorithm to explore the model hyper-parameter spaces, and after searching a bit I have found the original implementation, but there was no recent activity on it. After some other tries I have found Optuna and the article that introduces it:

Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, Masanori
Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization
Framework. In The 25th ACM SIGKDD Conference on Knowledge Discovery
and Data Mining (KDD ’19), August 4–8, 2019, Anchorage, AK, USA. ACM,
New York, NY, USA, 9 pages. https://doi.org/10.1145/3292500.3330701

The article is quite recent and provides references to other similar tools that can be useful: Spearmint and GPyOpt which use Gaussian Processes, Hyperopt employs tree-structured Parzen estimator (TPE), SMAC uses random forests, Google Vizier Katib
and Tune also support pruning algorithms, Ray Tune.

What makes Optuna special? It’s approach in defining the hyper-parameters space to explore. The example provided by the authors is very clear: they propose to tune a neural network that can have up to 4 layer, and where each layer can have up to 128 units. What is the difficulty here? You will always have a number of neurons for layer 1, but you will have a number of neurons for layer 3 only if layer 2 and layer 3 exists. For instance with Hyperopt you need to use choice constructs to describe this conditional relation and the code is not so easy to read. With Optuna instead you have this approach, called define-by-run:

Using Optuna you define a study (line 20) and you want to optimize it in 100 trials (line 21). The objective function defines which hyper-parameters are pertinent and the loss function. The trial here is the current test instance, at line 5 it is said that the model has up to 4 layers: actually at that point when executing the code Optuna will use the past experiments to decide which is the best candidate value to use. Then, with a simple for loop it is asked to Optuna to get the most promising number of neurons for each layer. It is easy to do so because you do not need to deal with condition, at that point in the code you know that you have x layers and you can just define the parameters needed. The rest of the code is to use the parameters, read the train and test data and compute the model performance.

With Hyperopt instead you have to do something like this:

You can find the complete examples in the paper. It is easy to understand that, with more and more options to explore, a model made with Optuna will be easier to understand and modify.

Optuna has also something more to offer: an efficient pruning algorithm: unpromising trials are abandoned before a complete train/test cycle, freeing resources to explore other configurations. The pruning algorithm used is named Asynchronous Successive Halving(ASHA) and is an extension of Successive Halving. To understand how it works it is better to read the article describing successive halving:

Kevin Jamieson, Ameet Talwalkar. Non-stochastic Best Arm Identification and Hyperparameter Optimization. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016, Cadiz, Spain.

Here Arm refers to a specific hyper-parameter instance. Intermediate loss functions values are compared between different arms, only half of the arms will survive each time selecting the best ones. In this setting, the question is not if the algorithm will identify the best arm, but how fast it does so relative to a baseline method, as no specific statistic assumptions are made. The plots published in the paper are really encouraging.

Coming back to Optuna, you can choose the algorithm used to select the next candidates parameters: you have TPE and also a mix of TPE and CMA-CS. According to the authors, GPyOpt can obtain better results but it is one order of magnitude slower than Optuna.

Other Optuna advantages are that you can choose the backend used to store the trials performances (for instance use a database), and that it is easy to install it and use it in conjunction with Jupyter and Pandas.

To conclude, you can use Optuna also for tasks not related to neural networks and machine learning: the authors reports that it has bee used to find optimal parameters for RocksDB (which has many many parameters to tune) and ffmpeg configurations for video encoding

Written by Giovanni

November 12, 2022 at 11:49 am

Posted in Varie

Giovanni Bricconi