Why should I compare recall to precision instead of the false positive rate when dealing with imbalanced data?

6 min readFeb 13, 2021

Background photo created by bedneyimages — www.freepik.com

You may have heard that commonly known classification metrics — such as ROC and gini — that compare recall (true positive rate) to the false negative rate are not as well suited to evaluate imbalanced datasets as is the precision-recall curve (which contrary compares recall to precision instead). Moreover, you may feel as confused as I also did when I first started wrapping my head around all of this. TPR, FPR, Recall, Precisions? How does everything connect? what the difference? Why does it matter?. Through this article, we’ll earn some intuition on what are these otherwise very dry and technical concepts, and better understand why using one set of comparisons versus the other (TPR vs FPR in the ROC curve vs TPR vs Precision in the precision-recall curve) can make all the difference when it comes to better suiting your analysis needs.

Let’s begin by creating crafting our own highly imbalanced dataset, and training a dummy model for our evaluation exercises:

we import the packages we’ll be using

we create a highly imbalanced dataset by selecting only 10% of the positive observations in the original dataset

we train a random forest model in using our highly imbalnaced dataset

we make predictions and save them as a separate dataframe for future analysis

What is precision and what is recall? What do I gain by contrasting these two metrics when evaluating a model?

Precision refers to the proportion from all the predicted positive observation that are truly positive:

True Positive / (True positive + False Positive)

Recall is the proportion from all truly positive observations that were predicted to be positive:

True Positive / (True Positive + False Negative).

Why is the relationship between the two worth analyzing?

Because it would be easy for us to maximize recall by just predicting all the observations as positive. We would have a useless model where we predict all the truly positive observations as such, but have a large number of False positives. To have a more holistic understanding, we want to know the recall, the proportion from the total truly positive observations that we predicted to be positive, but also the precision (the proportion of the observations we predicted positive that were truly positive).

Likewise, we don't only want to know how many of the observations we predicted positive were correct, we would also like to know how many of the total truly positive observations it is capturing as such. If we only look at precision, we may end up with a model that is almost always correct when it predicts an observation as positive (TPs), but that also has a large number of False Negatives (the model would be likely only predicting as positive the observations that are too obviously so— and why would you need a machine learning model for that?). This is why it is important to complement precision with recall.

Note: Unlike the ROC curve, a higher precision does not necessarily mean a lower recall (or a higher recall a lower precision). In some cases, recall may stay constant while precision fluctuates because the denominator for recall (TP+FN) does not depend on the threshold we are using to plot each point of the precision-recall curve. As the skleran documentation explains:

“The relationship between recall and precision can be observed in the stairstep area of the plot — at the edges of these steps a small change in the threshold considerably reduces precision, with only a minor gain in recall.”

Other than summarizing the precision-recall curve using a metric such as the area under the curve or AUC, we can also use the average precision and the F1 score. The average precision is calculated as the average precision weighted per change in recall, and the F1 score is the harmonic mean of the precision and recall. Following the exact formulas for each:

Average precision: Sum( (Rn — Rn-1)*Pn )

where Rn, Rn-1and Pn = Recall for the nth observation, recall for the nth-1 observation, and precision for the nth observation.

F1 score = 2 * (Precision* Recall) /(Precision+ Recall)

How does the precision-recall compare with the ROC curve and gini?

Just like the precision-recall curve explores the relationship between precision and recall, the ROC curve explores the relationship between recall (also called the true positive rate) and the false positive rate.

As we know from before, recall = TP/TP+FN. The proportion from all truly positive observations that were predicted to be positive.

The false positive rate = FP/FP+TN. The proportion from all truly negative observations that were predicted to be positive.

Caveat: if you have a highly imbalanced dataset, you’ll have a much larger number of true negatives than true positives. Making the denominator for the false positive rate much larger than the denominator for the recall. In addition, you’ll likely have a pessimistic model (a model that predicts very few observations as positive because it was trained with few truly positive observations). Having a high false positive rate per threshold becomes considerably more difficult as a result. The ROC will be inflated — as the recall per each threshold will correspond to a very small false positive rate.

In other words, the ROC focuses on what proportion of the total positive observations and total negative observations are the sum of the observations we predicted positive at the different thresholds of prediction. However, highly imbalanced datasets may lead us to believe that we have a good model (our positively predicted observations are a much larger higher proportion for the total truly positive observations than the total truly negative observations) when in reality half the observations our model predicts as positive are not. In order to avoid being misled this way, a precision-recall curve analysis preferable in assessing model performance.

Let’s look at the ROC and precision-recall curves for the model we created:

According to the ROC this model is almost perfect. We have an area under the curve of 0.99 and a gini of 0.98. However, when we only begin to understand that (depending on the threshold) our model’s precision can drop to an 85% (15% of observations predicted to be positive are not really positive) with a recall of around 85% as well ( 15% of total truly positive observations were not predicted as positive by the model). While this is indeed a good model, it is not perfect — precision and recall allows us to gain a better understanding than ROC of its performance around the imbalanced observation class (positives in our case).

Technical note:

If you would like to avoid using the plot_precision_recall_curve() function offered by scikit learn, its is possible for you to use any other plotting package. Just make sure you have the axis limits set correctly as packages such as seaborn will tend to arbitrarily assign them for you.

Heres’s an example of a precision recall curve gone wrong because of carelessness with regards to x and y axis limits. If you would like to plot the exact same precision-recall curve without directly using the scikit-learn function (maybe you don't have the classifier at hand), here’s a code extract from the function’s source code that will help you:

Thank you for your time. I hope you found this article to be useful. I’m on linked in. Feel free to write me with any comments or questions.

Why should I compare recall to precision instead of the false positive rate when dealing with imbalanced data?

What is precision and what is recall? What do I gain by contrasting these two metrics when evaluating a model?

How does the precision-recall compare with the ROC curve and gini?

Let’s look at the ROC and precision-recall curves for the model we created:

Technical note:

Ana Preciado - Data Scientist - Banco General | LinkedIn

Contact: anamargaritapreciado@gmail.com | +(507) 61305543. Ever since I did my undergraduate research on e-commerce, I…

Written by Ana Preciado