Nothing easier than a linear regression?

This week I have been suggested to watch this video:

Machine Learning Algorithms in 7 Days

Video link
By Shovon Sengupta

It’s 5 hours and a half video on machine learning algorithms, authored by a principal data scientist at Fidelity Investments. Of course I did not have 5 hours spare time this week to follow it all, I just started watching the first “chapter” on linear regression. Shovon Sengupta made the subject passionating adding a lot of references to statistic tests. Actually the chapter goes so fast that you need to take some extra time searching explanations and definitions to understand better. Luckily there is support material, and you have access to a git report with the Jupyter notebooks used, so you can read them an test them later.

So what is the recipe to cook a linear regression?

Clean the input data: check for missing data and decide what to do with outliers as the linear regression will be influenced by them. I hope that for outliers there will be another section in the video because in this part it went too quick – but in the sample there is at least one suggestion
Check for multi-collinearity: this is interesting and well explained. It may happen that some of the input variables (explanatory variables) have a linear correlation between them, so they are not really independent. To make an absurd example you can have two input columns, one in the temperature in Fahrenheit degrees and the other in Celsius degrees. You must not use both of them or the linear regression algorithm may not work well.
Select the features: you should analyze the explanatory variables to decide which ones to use, those that have more influence on the output variable. The training explains the Ridge and LASSO methods.
Build the model on some training data
Validate the model performance on test data.

So, even if it is just a linear regression, to do it well you need to do many not obvious things. Here I list for myself many definitions that are needed to understand the samples provided in the training

R2 explains the proportion of variance explained by the model

So the sum of squared prediction errors divided by the error you have pretending that the model is just the constant average value

You can have a better adjusted R2 index, taking into account the number of samples you have n and the number of input variable you use p:

You use R2 to calculate the variance inflation factor that is just 1/(1-R2) and it is an indicator useful to check if there is multicollienarity. The lower the better

The F statistic to check if there is a good relationship between the predicted output and the chosen explanatory variables. The higher the value the better is the relation.

The Durbin Watson test is instead used to test the presence of autocorrelation:

It will always have a value ranging between 0 and 4. A value of 2.0 indicates there is no autocorrelation detected in the sample. Values from 0 to less than 2 point to positive autocorrelation and values from 2 to 4 means negative autocorrelation.

DW test statistic values in the range of 1.5 to 2.5 are relatively normal. Values outside this range could, however, be a cause for concern
From <https://www.investopedia.com/terms/d/durbin-watson-statistic.asp>

Another concept to check is the prediction error homoscedasticity: is the error just coming from a normal distribution or does it have an evolution? With the Breusch Pagan test you can check if the error

The last interesting concept is the regularization: the coefficients identified by the regression algorithm can tend to explode, or include variable that cause sample data overfitting. It is important to boost simpler models and this can be done introducing some penalty terms in the loss function

LassO Least Absolute Shrinkage and Selection Operator,

Ridge

The lasso method has the advantage of removing low importance variables from the model, while the Ridge makes them very small but not zero. See <https://www.datacamp.com/tutorial/tutorial-lasso-ridge-regression> for a nice description.

But concretely how to use all these concepts? Luckily the training points to this git repository, where you have a nice python script that step by steps applies many of these checks on some sample data, and produces many graphs that helps understanding the concepts. https://github.com/PacktPublishing/Machine-Learning-Algorithms-in-7-Days/blob/master/Section%201/Code/MLA_7D_LR_V1.ipynb

Nothing easier than a linear regression? Not at all, if you don’t want just scratch the surface!

Written by Giovanni

December 11, 2022 at 12:02 pm

Posted in Varie

Giovanni Bricconi