Machine leaning algorithms in 7 days: what was left

Christmas holidays and the flu have not been propitious to continue my training, but at least I finished watching the video

Machine Learning Algorithms in 7 Days By Shovon Sengupta
Video link

In this post I will just describe what is the content of the remaining chapters, so that who is interested can have a look to the video or to the examples. I believe the examples are very good, showing the good direction when trying something new; they are available on git-hub for free https://github.com/PacktPublishing/Machine-Learning-Algorithms-in-7-Days

The training contains also the following subjects:

Decision tree
Random forest
K-means algorithm
K-nearest neighbors
Naive Bayes
ARIMA time series analysis

The samples show how to use locally Jupyter notebooks with scikit-learn package to train the models

The decision tree chapter describes the concept of purity measure: a new split in the tree will be introduced if it helps classifying better the training examples. For instance the Gini index can be used for categorical variables, but also statistic tests like the chi-square. For numeric variables the F-test can be used. Ideally each terminal node in a tree should output a single category. The video introduces also very quickly the concept of pruning: it is necessary to reduce the number of splits to avoid overfitting. Given the pruning strategy names you can dig further on internet, as the description is too quick. Decision trees have the advantage of being easy to interpret, but they can be unstable to small input changes.

The random forest chapter describes an incremental approach built on top of decision trees. Multiple decision trees are generated selecting different input features: the final classification/prediction can be obtained averaging on the results or choosing the most voted class (softmax) from all the trees in the forest. The video does not speak at all of XGBoost, the algorithm I described some time ago.

The K-Means chapter describes this unsupervised learning algorithm. Using a distance measure some K points (cluster centroids) are chosen, iteratively updating theirs position so that they partition as best as possible the input data. Being based on distance, it is important to normalize the input data, so that one dimension does not take much more importance than the others: if one dimension ranges from 1 to 100 and the other from 1 to 10, it is clear that the squared error is biased. Also you need to deal with categorical data, maybe doing 1 hot encoding. The number of centroids K is a very important parameter for this algorithm, the video describes some methods to find a good value for it.

The KNN (K-nearest neighbors) chapter introduces the concept of finding the closest samples to a point, and speaks about K-D tree and Ball tree algorithms to find the neighbors more efficiently than with brute force. The nearest neighbors found will be used to classify or predict the value for the input point.

Naive Bayes, often used for text classification, is quite effective to perform binary or multi-class classification. Each training sample will have many attribute dimensions, the assumption here is that each attribute value is not correlated with the other dimensions. This is needed to apply the Bayes formula, of course will not be true but applying this simplification still allows to obtain interesting results. This is the example used to explain the concept: you want to predict if a mail is or not spam; of this mail you know the content, the words. You also have a classified mail corpus, where you know if a mails is spam or not and which words were present. The algorithm uses statistics on word count to calculate the probability of a new mail being spam. of course a mail containing the word lottery will probably contain also the word prize, but with naive Bayes you won’t take this into account and just look at prize and lottery distribution. There are many types of naive Bayes algorithm: Gaussian for continuous data, multinomial, and Bernoulli if all features are independent and boolean.

The final Arima chapter, describes how to deal with time series. In this case the input sequence can present a trend and a seasonality, and you want to predict the next values after the sequence end. You apply some techniques to make stationary the input sequence: remove exponential trends, calculate the difference between successive values, etc. The goal is to find a set of parameter that let the Arima model work well, finally producing a residual error that is white noise. The Arima model usually has 3 parameters: the number of autoregressive terms, the number of non seasonal differences, and the number of moving averages. Applying it seems not trivial, but the results produced by the example are quite impressive.

Written by Giovanni

January 6, 2023 at 5:26 pm

Posted in Varie

Giovanni Bricconi