Principal Component Analysis – reducing data dimensionality

Your data may have been collected using many dimensions, but all of them do convey useful information? There are some techniques that could help in reducing the number of dimensions to handle – thus simplifying the problem and speeding up the analysis. Here I report some links to pages speaking of Principal Component Analysis (PCA), Correspondence Analysis and Multiple Correspondence Analysis.

Let’s start from Principal Component Analysis (PCA). I have found a reference to it in the t-Stochastic Neighbor Embedding paper. In that context it has been used to reduce the number of dimensions to analyze, as that technique has an n2 complexity, therefore is important to limit the number of used dimensions.

PCA has been introduced by Harold Hotelling long time ago, 1933, and you can have a look to the original article on hatitrust.org: Analysis of a Complex of Statistical Variables into Principal Components. This methods works only for numerical continuous dimensions, no way to analyze categorical data with that. The method applies only linear transformations, so it has limitations, but is widely used.

Instead of reading the long original paper, I have found some pages that explain how it works in simple terms: A Step-by-Step Explanation of Principal Component Analysis (PCA). The core idea is to linearly combine the data axes and find a new axis/dimension that captures the maximum of the data variance. The process is then repeated over and over again; at the end of you have an ordered list of orthogonal axes associated to how much variance they capture. You can then decide to keep only the dimensions that covers x% of the initial variance, to simplify the model. You can also discover constant relations in your data, by looking at dimensions with little or no variance. Anyway, the new dimensions will be a mix of the original ones, so they will not be easy to be interpreted as it was before.

Notice also that PCA requires to standardize ( (value-mean)/std-deviation ) the data in each dimension: PCA is sensible to high values and outliers, you do not want to get fooled by that. Once done that, PCA computes the data Covariance Matrix and its Eigenvectors and Eigenvalues. The Eigenvectors are the new dimensions, the Eigenvalues will tell you which is the most important dimension (it is the absolute value that matters, negative values are possible). So instead of searching the most important dimensions one by one, you obtain that with an efficient math procedure. The one by one dimension procedure, was just to simplify the understanding.

As final step, once selected the few dimensions you want to use, you will have to recast the original data points to the new dimensions. You will have the new data to be used during machine learning. See step 5 in A Step-by-Step Explanation of Principal Component Analysis (PCA.

The “Principal component analysis: a review and recent developments” is a longer reading with equations, that contains an interesting example. The Kuehneotherium is a prehistoric animal, some teeth fossils have been found, measured, and the page explains how PCA can be used to get insight on the data.

Kuehneotherium is one of the earliest mammals and remains have been found during quarrying of limestone in South Wales, UK [12]. The bones and teeth were washed into fissures in the rock, about 200 million years ago, and all the lower molar teeth used in this analysis are from a single fissure. However, it looked possible that there were teeth from more than one species of Kuehneotherium in the sample.
Principal component analysis: a review and recent developments

The input data set had 9 dimensions, but after applying PCA they saw that with just 2 dimensions you can explain 78.8% and 16.7% of the data variance… the other dimensions are not really needed. The page also introduced a type of plot called biplot: take the 2 most important dimensions, and project all the data points in this plane, you can see how the data are arranged (if 2 axes can explain enough variance). In this plane you can also plot vectors that represent the original data dimensions, and theirs orientation will give you insight on which variable is more correlated with the others. Visit the site to get more information about biplots: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792409/figure/RSTA20150202F2/

Biplot example, taken from wikipedia The red arrows show the strong correlation between petal length and petal width. The numbers in the graph are the labels assigned to the data samples

PCA can’t handle categorical data, but for this kind of problem another interesting technique exists: correspondence analysis CA. To apply this technique we start from a contingency table: for instance a frequency table with rows representing some brands and columns representing the frequency a customer associate that brand to a feature (ex. Valentino – luxury). Correspondence Analysis: What is it, and how can I use it to measure my Brand? (Part 1 of 2) is a nice article providing the feature/brand example. In theirs example 5 imaginary soda brands are associated to 3 qualities: tasty, aesthetic, and economic.

The CA method requires you to take some steps: calculate observation proportions (number of time brand A has been associated tasty / all the observations). Once done that you calculate the mass: just the sum of proportions of a single brand or of a single quality. With this information you can calculate the expected proportion that is the product between a brand weight and a quality weight and compare it to the actual observation (this is called residual). These numbers will be very small, but just make the proportion with the expected proportion, you get something much more easy to understand: example brand A is viewed 30% more associated to tasty than expected.

This is useful but not really the same as PCA; PCA was helping you dropping dimensions, so far we just have a way to interpret column and row relations in the contingency table. The following step will be to apply the Singular Value Decomposition to the data. At this step the data will be cast to dimensions that captures decreasing grades of variance, and will give each brand and each quality some coordinates. Just keeping the 2 most important dimensions will let you plot in 2 dimensions the brands position and the qualities positions: you will visually see how close a brand is associated to tastiness, cheapness,…

See for instance also the article How do I judge brans performace… for more insight on how to read this kind of plot.

When data can be arranged non just in tables, but in cubes, or higher number of dimensions, you can apply multiple correspondence analysis (MCA). But just taking some time reading CA vs MCA makes clear that MCA is much more complex to use and in reality CA is much more used.

What to do when you have mixed categorical and numerical variables? Seems not an easy task: searching a bit I have found Factominer : this R package is able to do analysis for instance on a set of wines characterized by categorical values such as brand combined with average acidity score (average vote provided by judges). Honestly I did not look much into it, just in some slides there are notes about comparing covariance matrix of different set of categorical variables.

Written by Giovanni

July 2, 2023 at 7:34 am

Posted in Varie

Giovanni Bricconi