1. Introduction

In this tutorial, we’ll talk about how to calculate the F-1 score in a multi-class classification problem. Unlike binary classification, multi-class classification generates an F-1 score for each class separately.

We’ll also explain how to compute an averaged F-1 score per classifier in Python, in case a single score is desired.

2. F-1 Score

F-1 score is one of the common measures to rate how successful a classifier is. It’s the harmonic mean of two other metrics, namely: precision and recall. In a binary classification problem, the formula is:

    [\textrm{F-1 Score} = \frac{2 \times \textrm{Precision} \times \textrm{Recall}}{\textrm{Precision} + \textrm{Recall}}]

The F-1 Score metric is preferable when:

  • We have imbalanced class distribution
  • We’re looking for a balanced measure between precision and recall (Type I and Type II errors)

As the F-1 score is more sensitive to data distribution, it’s a suitable measure for classification problems on imbalanced datasets.

3. Multi-Class F-1 Score Calculation

For a multi-class classification problem, we don’t calculate an overall F-1 score. Instead, we calculate the F-1 score per class in a one-vs-rest manner. In this approach, we rate each class’s success separately, as if there are distinct classifiers for each class.

As an illustration, let’s consider the confusion matrix below with a total of 127 samples:

binary classes multi class 1

Now let’s calculate the F-1 score for the first class, which is class a. We first need to calculate the precision and recall values:

    [\textrm{Precision}(class=a)} = \frac{TP(class=a)}{TP(class=a) + FP(class=a)} = \frac{50}{53} = 0.943]

    [\textrm{Recall}(class=a) = \frac{TP(class=a)}{TP(class=a) + FN(class=a)} = \frac{50}{108} = 0.463]

Then, we apply the formula for class a:

    [\textrm{F-1 Score}(class=a)} = \frac{2 \times \textrm{Precision}(class=a) \times \textrm{Recall}(class=a)}{\textrm{Precision}(class=a) + \textrm{Recall}(class=a)} = \frac{2 \times 0.943 \times 0.463}{0.943 + 0.463} = 0.621]

Similarly, we first calculate the precision and recall values for the other classes:

    [\textrm{Precision}(class=b)} = \frac{8}{35} = 0.228 \ \ \textrm{Recall}(class=b) = \frac{8}{13} = 0.615]

    [\textrm{Precision}(class=c)} = \frac{4}{26} = 0.154 \ \ \textrm{Recall}(class=c) = \frac{4}{4} = 1.000]

    [\textrm{Precision}(class=d)} = \frac{1}{13} = 0.077 \ \ \textrm{Recall}(class=d) = \frac{1}{2} = 0.500]

The calculations then lead to per-class F-1 scores for each class:

    [\textrm{F-1 Score}(class=b)} = \frac{2 \times 0.228 \times 0.615}{0.228 + 0.615} = 0.333]

    [\textrm{F-1 Score}(class=c)} = \frac{2 \times 0.154 \times 1.000}{0.154 + 1.000} = 0.267]

    [\textrm{F-1 Score}(class=d)} = \frac{2 \times 0.077 \times 0.500}{0.077 + 0.500} = 0.133]

4. Implementation

In the Python sci-kit learn library, we can use the F-1 score function to calculate the per class scores of a multi-class classification problem. 

We need to set the average parameter to None to output the per class scores.

For instance, let’s assume we have a series of real y values (y_true) and predicted y values (y_pred). Then, let’s output the per class F-1 score:

from sklearn.metrics import f1_score
f1_score(y_true, y_pred, average=None)

In our case, the computed output is:

array([0.62111801, 0.33333333, 0.26666667, 0.13333333])

On the other hand, if we want to assess a single F-1 score for easier comparison, we can use the other averaging methods. To do so, we set the average parameter.

Here we’ll examine three common averaging methods.

The first method, micro calculates positive and negative values globally:

f1_score(y_true, y_pred, average='micro')

In our example, we get the output:

0.49606299212598426

Another averaging method, macro, take the average of each class’s F-1 score:

f1_score(y_true, y_pred, average='macro')

gives the output:

0.33861283643892337

Note that the macro method treats all classes as equal, independent of the sample sizes.

As expected, the micro average is higher than the macro average since the F-1 score of the majority class (class a) is the highest.

The third parameter we’ll consider in this tutorial is weighted. The class F-1 scores are averaged by using the number of instances in a class as weights:

f1_score(y_true, y_pred, average='weighted')

generates the output:

0.5728142677817446

In our case, the weighted average gives the highest F-1 score.

We need to select whether to use averaging or not based on the problem at hand.

5. Conclusion

In this tutorial, we’ve covered how to calculate the F-1 score in a multi-class classification problem.

Firstly, we described the one-vs-rest approach to calculate per class F-1 scores.

Also, we’ve covered three ways of calculating a single average score in Python.