PCA for ML business insights: using dimensionality reduction to identify error profiles in your predictions

5 min readJul 5, 2023

Introduction:

Image by vecstock on Freepik. Website at: https://www.freepik.com/free-photo/fresh-coffee-steams-wooden-table-close-up-generative-ai_40950852.htm#query=coffee&position=2&from_view=search&track=sph

Machine learning is said to be both, an art and a science. The science aspect comes from the application of statistical algorithms, and the art comes from the integration of industry knowledge into the use-case. In this article, I want to propose a technique to leverages a dimensionality reduction analysis (science) to aid the discovery of factors driving the predictions error in a wholistic manner (art). The value of this exercize comes from knowing that the resulting components form the reduction will be highly abstract, but will allows us to see in a summarized manner the different structures within our data. By using our machine learning predictions, we can also identify what are the structures in the data where the majority of the error lie lies. From here, we complement the analysis with our industry knowledge, and brainstorm what can be driving the error for these groups as well as what are ways that we can best gather information with predictive value for them.

As said before, the main advantage of this exercise is the wholistic birds-eye view to the nature of the data (and also the error of our predictions) in a two dimensional space. For the example we have in this article, we’ll be using the open dataset available in the following kaggle link: https://www.kaggle.com/datasets/volpatto/coffee-quality-database-from-cqi. It contains quality information on coffee beans from different countries. For the sake of simplicity, I want to focus on type two errors ( errors pertaining to taste), and only leverage features pertaining to the different taste scores the beans obtain (this way all features are in a similar scale, and we don’t have to undergo additional transformations).

Because the objective is to undergo analysis, I wont be splitting the data into testing and training, and I will be using a rather simple algorihtm (decision tree with a max depth of two ) to ensure there are enough errors in the data to analyze.

Step 1: quick data description, and the creation of a mock model

Original Schema:

‘ID’, ‘Country of Origin’, ‘Farm Name’, ‘Lot Number’,
‘Mill’, ‘ICO Number’, ‘Company’, ‘Altitude’, ‘Region’, ‘Producer’,
‘Number of Bags’, ‘Bag Weight’, ‘In-Country Partner’, ‘Harvest Year’,
‘Grading Date’, ‘Owner’, ‘Variety’, ‘Status’, ‘Processing Method’,
‘Aroma’, ‘Flavor’, ‘Aftertaste’, ‘Acidity’, ‘Body’, ‘Balance’,
‘Uniformity’, ‘Clean Cup’, ‘Sweetness’, ‘Overall’, ‘Defects’,
‘Total Cup Points’, ‘Moisture Percentage’, ‘Category One Defects’,
‘Quakers’, ‘Color’, ‘Category Two Defects’, ‘Expiration’,
‘Certification Body’, ‘Certification Address’, ‘Certification Contact’

Features used

As said before i will be ignoring the majority of the features in order to focus only on the features that pertain to coffee bean taste:
‘Aroma’, ‘Overall’, ‘Flavor’, ‘Aftertaste’, ‘Acidity’, ‘Body’, ‘Balance’, ‘Uniformity’, ‘Moisture Percentage’, ‘Quakers’.

Target variable defined:

This exercise predicts for a simplified version of the ‘Category Two Defects’ variable available in the original schema. I made this column into a binary by transforming all types of defects cases into a 1, and keeping non-defects cases as zero.

Simple algorihtm — deicsion tree:

I will set the max-depth of the decision tree to two because I want this exercise not to focus on the quality of the predictions, but rather in the value that leveraging dimensionality reduction for error analysis can provide.

Code cells pertaining to this section

Step 2: Finding underlying structures in the data through PCA

Through a PCA plot, we can see that our data indeed has an underlying structure:

The majority of the observations (and error) have a value of 0 o less for the first component. However we can clearly see that the model seems to be performing well for observations to the right of said threshold. I personally find this view valuable because it allows us to have a birds eye view of the structures within our observation. Even though component 1 and component 2 have a rather abstract definition, the plot still begs the questions of what make the different component ranges different, as there is a discernable difference in error along its discrete axis.

If we break it down into quantiles we would obtain the following summary in terms of % of observations with error and total of observations with error (along the different component 1 quantile groups):

Step 3: Bringing this back to the original features

This is seems great, but it is useless if we don’t have a way to bring it back to actionable context. This is why we’ll proceed to plot the median of the different feature values for each of the quantiles that we previously explored. For this section of the analysis please feel free to leverage any tool you feel more comfortable with.

In order to plot all features used into one single visualization, I used the min max scaler transformer available with scikit-learn (and added a feature pertaining to the number of bags per bean type).

The visualization hints at us that there are indeed some visible patters across component 1 scale. To the left (the cases where the model is having the most trouble predicting), the different flavors have higher scores as well as lower humidity. To the right, humidity is higher, flavor scores lower, and there is also a higher number of bags for the different beans. From here, the data science team can communicate with the industry experts to ask questions of whether there is any difference in the cases with low humidity, higher flavor scores or lower number of bags that could be driving the error.

For example, does a lower number of bags mean that the coffee beans in observation had a lower number of samples for sensory scores than their counterparts? Can this under representation be driving the error? What is the relationship between humidity, and sensory flavor information, and why do beans with a lower number of bags seem to have a higher humidity? From there, the industry experts can start considering the potential causes for the model error, and brainstorm potential new variables or features with predictive values for these cases.

Conclusion

Machine learning is an art as well as a sicence, and it should not be decoupled from industry expertise in its development. Tools such as dimentionality reduction can help provide a summarized view of the underlyings structures of the data in order to spot or identify the areas of said structure with which the algorithm is having the most troule predicting. From there, we can identify patters within those groups in order to have a wholistic view at what types of profiles is the model struggling wtih the most. Here, industry experts can provide with their knowledge in finding the leading causes for the error, and help identify variables that could aid better predictions for these cases.