1. Introduction
There are several metrics for evaluating machine-learning (ML) models. One that we often calculate when analyzing classifiers is the score, which combines precision and recall into a single value.
In this tutorial, we’ll talk about its generalization, the score, which can give more weight to either recall or precision.
2. The F1 Score
The F1 score of a classifier is the harmonic mean of its precision and recall
:
(1)
It’s useful because it’s high when both scores are large, as we see in its contour plot:
It gives equal weights to recall and precision, so the contours are symmetric around the 45-degree line,
2.1. What if Precision and Recall Aren’t Equally Important?
However, there are cases where one of the scores is more important than the other.
We care more about recall if a false negative is more severe an error than a false positive. Automated diagnostic ML tools in medicine illustrate that. There, a false negative is a missed condition, which could be fatal for our patient’s health. In contrast, a false-positive diagnosis induces stress, but additional testing can relieve the patient.
Conversely, precision is more important when a false positive has a higher cost. That’s the case in spam detection. Letting a spam e-mail appear in the inbox may annoy the user, but marking a non-spam e-mail as spam and sending it to thrash could result in the loss of a job opportunity.
In such applications, we’d like to have a metric that considers the relative importance of and
. The
score does precisely that.
3. The F-Beta Score
The common formulation of is:
(2)
It’s a weighted harmonic mean of and
which uses
and
as the weights:
(3)
If , the recall is
times more important than precision, and if
, it’s the other way around. As the contours for
show, we can get a high
score if the recall is high enough no matter if the precision is low, which aligns with our requirements:
But why does figure in the equations instead of
? Isn’t the latter more intuitive?
3.1. Relative Importance of Precision and Recall
The reason why we have instead of
lies in how the relative importance was defined when
was first formulated.
In general, the weighted harmonic mean of and
using
and
as the weights is:
(4)
To get from
, we require the latter to satisfy the condition of relative importance. More precisely, we want
to be such that at the points at which
and
equally contribute to
,
is
times
.
Mathematically, that means that the ratio should be equal to
when the partial derivatives
and
are the same.
3.2. Derivation
Let’s first find the derivatives:
(5)
From , we get:
(6)
Requiring the ratio to be
, we solve
for
:
(7)
Plugging in into the weighted harmonic mean, we get
as defined by Equations (2) and (3).
3.3. The Effect of Importance
Let’s analyze what happens to as we vary
.
Setting to 1, we get the usual
. That covers the case with
and
having equal weights.
If only recall is important, we let . In that case, we expect
to reduce to
. Taking the limit, we get:
(8)
Similarly, if we care only about precision, we set to 0:
(9)
The values of between 0 and
represent intermediate cases.
4. Alternative Formulation of the F-Beta Score
A different definition of relative importance would yield a different score.
For instance, we could say that if we consider the recall score to be times more important than precision, that means that when
, increasing
improves
times as much as an equal increase in
.
Mathematically, this translates to the following condition:
(10)
Solving for , we get:
(11)
From there, we get a metric that is linear in :
(12)
It too reduces to when
but uses a different definition of relative importance than the version with
.
5. Conclusion
In this article, we talked about the score. We use it to evaluate classifiers when the recall and precision aren’t equally important. For instance, that’s the case in spam detection and medicine.
However, the two scores’ relative importance we quantify with has a formal mathematical definition: when their partial derivatives are equal, recall is
times as large as precision.