Rendered by QuickLaTeX.com

This problem is linearly separable. An SVM with a linear kernel would therefore learn the hyperplane that maximally separates these observations, which in this case is a line:

Rendered by QuickLaTeX.com

2.4. When the Number of Observations Increases

The SVM, in this example, uses 100% of the observations as support vectors. As it does so, it reaches maximum accuracy, whichever metric we want to use to assess it. The number of support vectors can however not be any lower than 2, and therefore this quantity does not appear problematic.

As the number of observation increases, however, we may decide that a line that separates strictly the two classes may no longer suit us:

Rendered by QuickLaTeX.com

This is because, if the number of observations is not trivial unless the problem preserves its linear separability, we’ll generally need to accept more sophisticated decision boundaries:

Rendered by QuickLaTeX.com

If we don’t, we’re going to lose inaccuracy. If we do, we’re going to rely on an increased number of support vectors. Either way, we’re obligated to conduct trade-offs between the two.

3. A Concrete Example

3.1. Preliminary Data Analysis

We can now study the concrete implementation of a SVM, for the purpose of modeling a common dataset for machine learning. As we do that, we’ll observe and note the way in which the number of support vectors chosen by the SVM affects both the accuracy and the shape of the decision boundary thereof.

The experiment that we discuss here is based upon the Iris dataset, which is a standard dataset for machine learning:

1

If we take two out of the four features associated with each observation, we can intuitively note how one of the classes is located in its own region of the feature space, while the other two aren’t. This strongly suggests that a linear kernel for the SVM may not be the most appropriate kernel unless we choose soft margins and a regularisation parameter that allows for a sufficient number of support vectors to be chosen.

The question then becomes: how do we choose the regularisation parameter that gives us the best trade-off, between the number of support vectors and the accuracy of the classifier for this particular dataset?

3.2. Linear Kernels

This type of question can be solved by studying the way in which the accuracy of the classifier and the number of support vectors change, as we modify slightly the regularization parameter. Let’s begin by considering the standard value of the regularization parameter; in this case, C=1:

2

From here onward, as an accuracy metric, we’ll always be using the Jaccard score. If the regularization parameter is 1, the SVM uses 81 support vectors and has an accuracy of 0.82, in order to classify the flowers of the Iris dataset.

3.3. Let’s Change the Regularization Parameter

Now, if we decrease C a little, we’d expect the number of support vectors to increase. This is because the regularization parameter is inversely proportional to the strength of the regularization process. And indeed, if we set C to 0.1, we obtain this:

3

The number of support vectors has gone up to 115, but the accuracy score has decreased to 0.80. This is because, as mentioned earlier, two of the classes occupy overlapping regions of the feature space. As the number of support vectors increases, the decision boundary to the bottom-right of the picture is moved slightly to the left, and more observations are thus incorrectly classified.

If we invert the line of reasoning, we may also expect the number of support vectors to go down, as the parameter C increases. For a value of C equal to 10, for example, we obtain this graph:

4

The number of support vectors, in this case, is 75 which is the lowest we recorded so far. The accuracy, however, is 0.82 which is not higher than the one we observed for C=1. We might be interested to see if, by incrementally reducing the value of C, we get to some higher values for the accuracy or some lower values for the number of support vectors, or both. As a general rule, we’d expect the accuracy to go up as the number of support vectors goes up, but this may not always be the case.

3.4. Accuracy and Number of Support Vectors

We can then replicate the procedure we followed above, by increasing and decreasing the value of C. In doing so, we can study the relationship that exists between the value of C, the number of support vectors that are identified, and the resulting accuracy score.

For this particular experiment, we tested all values of C contained in the interval ![0.01, 10) and sampled with a resolution of 0.01. This is the resulting chart:

5

Notice how the accuracy score never jumps over a certain value, 0.82 in this case. This is because the problem is intrinsically non-linearly separable. We, therefore, need to live with the idea that some observations will always be incorrectly classified by an SVM with a linear kernel.

Notice also, however, that increasing the number of support vectors does not necessarily increase the performance of the classifier. In fact, it seems that the trade-off between the number of support vectors and the accuracy that we believed to be true does not necessarily apply to this particular case. If the number of support vectors is already above a certain threshold, around 110 in this case, then there are decidedly decreasing marginal returns that derive from the choice of an additional support vector.

3.5. Non-Linear Kernels

Because we believe that the lower accuracy observed for this problem, as the number of support vectors increase, is dependent upon the non-linear separability of the problem itself, it then makes sense to use SVMs with non-linear kernel and see what happens to their accuracy as we change C. We’d expect, in general, an SVM with a non-linear kernel to perform better than a linear SVM. For this experiment, we chose an SVM with a polynomial kernel of degree 3. Here is its graph for C=1:

7

As we did earlier, we can then shift the value of C up and down and keep track of the corresponding accuracy score and the number of support vectors. For simplicity, we won’t report any other individual cases. Instead, here are the results of the training for all values of ![C \in 0.01, 10), with a resolution of 0.01:

6

As we can see from this graph, the lowest value observed for the accuracy of a polynomial SVM is higher than the lowest value we obtained with a linear SVM. This suggests that, for this particular problem, the non-linear SVM performs slightly better, as we expected.

3.6. Accuracy with Minimal Number of Support Vectors

We can also notice, though, that increasing the number of support vectors for the polynomial SVM leads to decreased accuracy much sooner than was observed with a linear SVM. In particular, we can see how we have a marked drop in accuracy if the number of support vectors is higher than 70, which is much lower than the value of 110 that we observed in the previous case.

Because the highest accuracy score, 0.82, is observed only for a number of support vectors corresponding to 69 or 70, we can move to study the relationship between the regularisation parameter C and the associated accuracy, when the number of support vectors is either 69 or 70:

8

We can see that there is a sweet spot for C\cong 2 for which, notwithstanding the invariance in the number of support vectors, the accuracy is maximized.  This is all the information we need, in order to decide what is the most optimal trade-off between the accuracy of the classifier and the number of support vectors.

3.7. So, Which Kernel and Which Regularization?

For the considerations we made here, it makes therefore sense to use a non-linear kernel for this particular problem. This is because, if we compare the distributions of accuracies versus the number of support vectors for a linear and a non-linear kernel, the non-linear kernel is generally more accurate and uses fewer support vectors:

9

Upon choosing the type of kernel, we can also decide on the optimal value of the regularization parameter. In the previous section, we saw that, for the polynomial kernel, the accuracy was maximized and the number of support vectors was minimized, simultaneously, for a value of C corresponding to approximately 2. That is to say, for this particular dataset the best approach would be to select a non-linear kernel with that particular regularization parameter.

3.8. What About Other Datasets?

The methodology that we have discussed here is not specific to this particular problem. It falls, instead, under the general category of machine learning heuristics. Knowing the collection of heuristics is important when we’re confronted with the problem of optimizing the hyperparameters of a machine learning model.

In this specific case, we solved this problem by manually conducting the operation of testing and assessing the possible values that the parameters of the model can assume. More generally, though, we may want to apply automated methods for searching the hyperparametric configuration of a model, such as by applying grid search.

Then, we assessed manually which combination of the accuracy score and size of the model satisfied our needs. We were lucky to find that it is possible to maximize accuracy and minimize the size for some particular configuration. This may however not always be the case, and therefore caution is advised.

4. Conclusions

In this article, we studied the relationship between the number of support vectors and the performances of a support vector classifier.

We initially discussed the role that support vectors have in determining the shape of the decision boundary. We’ve also introduced the idea that increases in accuracy and decreases in the number of support vectors can be traded for one another.

Finally, we’ve applied the previous considerations to the study of the performances of an SVM classifier on a famous dataset for machine learning.