1. Introduction
Ensemble methods in machine learning involve combining multiple classifiers to improve the accuracy of predictions.
In this tutorial, we’ll explain the difference between hard and soft voting, two popular ensemble methods.
2. Ensemble Classifiers
The traditional approach in machine learning is to train one classifier using available data.
In traditional machine learning, a single classifier is trained on available data. However, each classifier family has assumptions about the data, and its performance depends on how well these assumptions are met. Additionally, training a model from the same classification family on different subsets of data can result in models of varying performance.
To address this issue, we can train multiple classifiers and combine their outputs when classifying new objects. This usually improves performance but at the cost of increased processing time.
3. Hard Voting
Let be the various classifiers we trained using the same dataset or different subsets thereof. Each returns a class label when we feed it a new object .
In hard voting, we combine the outputs by returning the mode, i.e., the most frequently occurring label among the base classifiers’ outputs.
For example, if and , , and , the hard-voting outputs 1 as it’s the mode.
The final output doesn’t need to be the majority label. In multiple classification problems, it can happen that no label achieves the majority.
4. Soft Voting
In soft voting, the base classifiers output probabilities or numerical scores.
4.1. Binary Classification
For instance, in binary classification, the output of logistic regression can be interpreted as the probability of the object belonging to class 1. Similarly, an SVM classifier’s score is the signed distance of the object being classified to the separating hyperplane.
A soft-voting ensemble calculates the average score (or probability) and compares it to a threshold value.
For example, let , , and be the estimated probabilities that belongs to class 1. Soft voting outputs a mean probability lower than 0.5:
This soft-voting ensemble would assign the label 0 to , in contrast to the hard-voting ensemble from the previous example.
4.2. Do We Always Use Means?
We aggregate the results by averaging the base scores.
However, it’s also possible to use the median instead of the mean, as it’s less sensitive to outliers, so it will usually represent the underlying set of outputs better than the mean.
Still, that doesn’t imply that the median is always a better choice. For example, let’s say that and estimate near-zero probabilities that the input object is positive. The remaining classifiers return probabilities greater than 0.5, but none is as confident that is positive as and are that it isn’t. It may make sense to trust the two classifiers that are pretty confident over the rest. The rationale is that their evidence may be much stronger, which is why their probabilities are near zero.
4.3. Multiclass Classification
In this scenario, each underlying classifier outputs a vector whose th coordinate is the estimated probability that the input object belongs to the th class.
For example:
To combine them, we average the vectors element-wise:
The first coordinate is the maximum, so we assign to the first class.
We can use the vector approach in binary classification as well. In that case, we’ll deal with two-dimensional vectors.
5. Conclusion
In this article, we talked about hard and soft voting.
Hard-voting ensembles output the mode of the base classifiers’ predictions, whereas soft-voting ensembles average predicted probabilities (or scores).