precision-recall-visualised Image Source: Wikimedia Commons

Definition

Let us consider a experiment from positive instances and negative instances for some condition. The 4 possible outcomes can be formulated in a confusion matrix as follows:

Total PopulationPredicted Positive (PP)Predicted Negative (PN)
Positive (P)True Positive (TP)False Negative (FN)Recall
Negative (N)False Positive (FP)True Negative (TN)
PrecisionF1 score

Precision and recall are then defined as follows:

Intuition

Precision measures the ratio of the number of true positives (, i.e., instances that are actually positive and also predicted to be positive) over the number of instances that are predicted to be positive (). We can interpret this as a measure of the model’s positive prediction:

What proportion of positive predictions was actually correct?

That is, precision focuses on the model’s quality of the predictions.

On the other hand, recall measures the ratio of the number of true positives () over the number of instances that are actually positive (). This can be interpreted as:

What proportion of actual positives was identified correctly?

That is, recall focuses on the dataset; the completeness of the predictions.

Limitations

However, precision and recall metrics alone are quite limiting as they are easy to ‘cheat’ for a better score.

Precision

Precision does not take into account negative predictions. This allows the following type of cheating:

  1. Make 1 correct guess as positive.
  2. All else are predicted as negative (resulting in all or )
  3. Then,

Recall

  1. Make every guess as positive.
  2. Then,

Notice that precision and recall complement each other. If precision is cheated to 1, then recall is 0; vice versa. The question arises then if we can design a system so that it optimises for both of the scores simultaneously in a single score. This is where F1 score comes into play.

References