Mhhh choosing hyperparameters, this week I am disappointed
Last week I tried to learn more about LSTM, and I was asking myself if there are some rules to choose the hyperparameters. Well, not really rules but al teast some hints: I tried searching for “tuning lstm”, “lstm sizing”, etc and I did not find anything promising. Finally I have found this article:
R.R. Rajalaxmi, L.V. Narasimha Prasad, B. Janakiramaiah, C.S. Pavankumar, N. Neelima, and V.E. Sathishkumar. 2022. Optimizing Hyperparameters and Performance Analysis of LSTM Model in Detecting Fake News on Social media. ACM Trans. Asian Low-Resour. Lang. Inf. Process. Just Accepted (January 2022). https://doi.org/10.1145/3511897
I discovered that some fake news dataset exists, and some researcher are trying to distinguish fake news from true ones, using many different models, not just neural networks.
The ISOT dataset is composed of 21’417 true news and 23’481 fake news. The LIAR dataset instead classifies news in more categories: true, false, mostly true, half true, barely true and pants fire. LIAR is just 6’058 news in total; actually this is the size reported on the paper, on LIAR site the size seems the double – maybe because the dataset comes already divided into train test and validation items.
I was curious to know what a “pants on fire” fake news looks like, here is one:
Wisconsin sold fresh water to California, helping that states dairy industry, and lost a bunch of jobs (and) a bunch of farmland.
LIAR dataset
Ok now I have an idea 🙂
The paper’s authors have built an LSTM model to predict if a news is fake or real; there are no details, I suppose the time factor here is the word position in the phrase, but it is just my guess. Their problem was to pick the right combination of hyperparameters to use, in order to obtain the most accurate model.
To clarify here are some examples
Number of neurons: from 64 to 512
Drop-out rate: from 0.1 to 0.9
Which optimizer to use: adagrad, adadelta, sgd…
I have already found the drop-out rate in other papers: it selects a percentage of neurons to ignore during the training. You may want to do this to prevent the model to overfit the training data: turning randomly off some of your network in the end allows building a less fragile model.
Finally selecting the good hyperparameters combination was just done with grid search and random search. With grid search you just tabulate a long list of possible combination, you take your time to train a model in that case and then compare the results choosing the best one. I did this already many times in a naive way.
With random search insteand, you randomly generate the combinations to be tested. Accoridng to the authors this aproach works in general better.
So in the end I did not find the hints I was looking for, I just come to a new wikipedia page where some more smart methods are also listed: https://en.wikipedia.org/wiki/Hyperparameter_optimization. Maybe the next time I will explore some of them, or I will have a look at the anova method used in the big paper on selecting LSTM architecture: but this one seems requesting a lot of statistics knowledge I don’t have.
Leave a comment