1. Introduction
In this tutorial, we’ll show how to plot the decision boundary of a logistic regression classifier. We’ll focus on the binary case with two classes we usually call positive and negative. We’ll also assume that all the features are continuous.
2. Decision Boundary
Let be the space of objects we want to classify using machine learning. The decision boundary of a classifier is the subset of containing the objects for which the classifier’s score is equal to its decision threshold.
In the case of logistic regression (LR), the score of an object is the estimate of the probability that is positive, and the decision threshold is 0.5:
Visualizing the boundary helps us understand how our classifier works and compare it to other classification models.
3. Plotting the Boundary
To plot the boundary, we first have to find its equation.
3.1. The Boundary Equation
We can derive the equation of the decision boundary by plugging in the formula of LR into the condition .
We’ll assume that is an vector with to make the LR equation more compact. In preprocessing, we can always prepend to any -dimensional .
Consequently, we have:
where is the parameter vector of our LR model . From there, we get:
3.2. The Shape of the Boundary in Two and Three Dimensions
If , the boundary equation becomes:
That’s a line in the ) plane. For example, if we use the iris dataset with and being the sepal length and width, and with versicolor and virginica classes blended into one, we’ll get a straight line:
We don’t have to consider the degenerate case where . That implies that no features of are used, so we won’t use such a model anyway. Let’s say . The explicit boundary’s equation is then:
If , we have a plane given by:
Let . Then, we can write the equation in the explicit form:
3.3. Algorithm
We can use any plotting tool to visualize lines and planes corresponding to these equations. If we have to do it from scratch, we can iterate over the independent features in small increments and calculate the dependent feature using the explicit forms:
The limits and determine the part of the boundary we want to focus on.
3.4. The Limits of Visualization
We have two questions at his point:
- Can we visualize a boundary in multiple dimensions?
- Is a boundary always a line or a plane?
Let’s find out.
4. Multiple Dimensions
If our objects have more than three features, we can visualize only the boundary’s projections onto the planes and spaces defined by pairs and triplets of features.
One way we deal with this is to choose the features for visualization and keep the others at constant values, such as means or constants of interest we know from theory. Values that mean “the feature is absent or neutral” can also be helpful. In most, but not all cases, that would mean setting those other features to zeros.
4.1. Example
With 10 features , we have feature pairs. Let’s say we choose and for visualizing the boundary. In that case, we set to some constant values. Let them be .
Then, is another constant. We add it to and proceed as if and are the only two features of interest:
This can work for any pair or triplet of .
However, a disadvantage of this approach is that the boundary depends on the chosen constants .
5. Curvatures
We can introduce curvatures with feature engineering.
Let’s say that our original features are . Before pretending , we can add . Then, the decision boundary becomes:
which is a curve in the original space. For instance:
However, the same boundary is a plane in the augmented space.
6. Conclusion
In this article, we showed how to visualize the logistic regression’s decision boundary. Plotting it helps us understand how our logistic model works.