1. Overview
In this tutorial, we’ll study the definition of cross-entropy for machine learning.
We’ll first discuss the idea of entropy in information theory and its relationship to supervised learning.
Then, we’ll see how to derive cross-entropy in bivariate distributions from the definition of entropy in univariate distributions. This will give us a good understanding of how one generalizes over the other.
Finally, we’ll see how to use cross-entropy as a loss function, and how to optimize the parameters of a model through gradient descent over it.
2. Entropy
2.1. Entropy and Labels in Supervised Learning
In our article on the computer science definition of entropy, we discussed the idea that information entropy of a binary variable relates to the combinatorial entropy in a sequence of symbols.
Let’s first define as a randomly distributed binary variable. Then, we can compute its Shannon measure of entropy as the combinatorial entropy of the two symbols, the 0 and 1 bits, that the variable can assume. The formula for is this:
When we work on problems of classification in supervised machine learning, we try to learn a function that assigns one label among a finite set of labels to the features of an observation. The set of labels or classes , then, comprises several distinct symbols which we can treat as possible values assumed by the output of a model. It follows that we can compute a measure of entropy for the class labels output by a predictive model for classification.
2.2. Probabilistic Rather Than Deterministic Classification
There are two ways to transit to a probabilistic definition of entropy that allows us to work with the probability rather than with the discrete distribution of the labels. The first way is to interpret the relative frequency of occurrence of classes as a probability of their occurrence. This means that we can consider as the number of times the class occurs in the distribution of classes, divided by the length of the distribution.
The second relates to the consideration that some classification models are intrinsically probabilistic and don’t output single-point predictions, but rather, probability distributions. This has to do with the activation function that’s used in the outer layer of a classification model. The most common probabilistic functions for the output layers of machine learning models are:
- the logistic function
- the softmax function
- the hyperbolic tangent function, if normalized to the interval
These functions output a value or set of values between 0 and 1 that we can therefore interpret as the probability distribution of the class affiliation of the observations.
2.3. Entropy and Probabilistic Distributions of Labels
The softmax function in particular, rather than outputting a single class as the most likely label for a given input, returns a probability distribution over the whole set . This probability corresponds to the individual probabilities that are assigned to each possible label .
We can subsequently use them in order to calculate the entropy for the distribution of class labels and their associated probabilities :
2.4. Practical Example of Entropy in Classification
Let’s imagine, for example, that we’re conducting binary classification by using logistic regression. The output of the logistic model is a value comprised between 0 and 1, which we normally interpret as the probability of the input being affiliated with the first class. This implies that the second possible class has a corresponding probability , tertium non datur in binary classification.
We can initially assume that the logistic model has an input with a single feature, no bias term, and a parameter for the unique input equal to 1. In this sense, the model corresponds perfectly to the sigmoidal function with .
We can then interpret the two probabilities and as the probability distribution for a binary random variable, and compute the entropy measure accordingly:
Not surprisingly, the entropy of is maximized when the output of the classification is undecided. This happens when the probability assigned to each class is identical.
2.5. Working With Multiple Probability Distributions
We can however also work with multiple probability distributions and the respective models. This is, for example, the case if we’re comparing the output of multiple models for logistic regression, like the one we defined above.
Let’s imagine we want to compare the previous model with a second model . One way to conduct this comparison is to study the differences existing between the relative two probability distributions and their entropies.
If we imagine that for the two parameters are and , we then obtain a model with this associated entropy:
Notice how the entropies of the two models don’t correspond. This means that, as a general rule, the entropy of two different probability distributions is different.
2.6. Some Entropies Are More Equal Than Others
Finally, if we compared the two entropies and with a third entropy , originating from a logistic model with parameters and , we’d observe this:
It istinctively appears to us that the first two probability distributions, associated with the classifiers and , have entropies that are more similar to one another than the entropy of the third classifier .
This gives us the intuitive idea that, if we want to compare the predictions between probabilistic models, or even between a probabilistic model and some known probability distribution, we need to use some dimensionless measure for the comparison of their respective entropies.
3. Cross-Entropy
3.1. The Definition of Cross-Entropy
On these bases, we can extend the idea of entropy in a univariate random distribution to that of cross-entropy for bivariate distributions. Or, if we use the probabilistic terminology, we can expand from the entropy of a probability distribution to a measure of cross-entropy for two distinct probability distributions.
The cross-entropy of the two probability distributions and possesses this formula:
3.2. Cross-Entropy for Model Comparison
We can apply this formula to compare the output of the two models, and , from the previous section:
This is the graph of the cross-entropy for these two particular models:
Notice that the cross-entropy is generally (but not necessarily) higher than the entropy of the two probability distributions. An intuitive understanding that we can have of this phenomenon is to imagine the cross-entropy as some kind of total entropy of two distributions. More accurately, though, we can consider the cross-entropy from two distribution to distance itself from the entropy of those distributions, the more the two distributions differ from one another.
3.3. Pair Ordering Matters
Notice also that the order in which we insert the terms into the operator matters. The two functions and are generally different. This is, for example, the graph that compares the cross-entropy of the two logistic regression models, while swapping the terms:
This is particularly important when we compute the cross-entropy between an observed probability distribution; say, the predictions of a classification model, and a target class distribution. In that case, the true probability distribution is always the first term , and the predictions of the model are always the second, .
4. Model Optimization Through Cross-Entropy
4.1. Cross-Entropy as a Loss Function
The most important application of cross-entropy in machine learning consists in its usage as a loss-function. In that context, the minimization of cross-entropy; i.e., the minimization of the loss function, allows the optimization of the parameters for a model. For model optimization, we normally use the average of the cross-entropy between all training observations and the respective predictions.
Let’s use as a model for prediction the logistic regression model . Then, cross-entropy as its loss function is:
4.2. Algorithmic Minimization of Cross-Entropy
We can then minimize the loss functions by optimizing the parameters that constitute the predictions of the model. The typical algorithmic way to do so is by means of gradient descent over the parameter space spanned by .
We discussed above how to compute the predictions for a logistic model. Specifically, we stated that predictions are computed as the logistic function of a linear combination of input and parameters:
We also know that the derivative of the logistic function is:
From this, we can derive the gradient with respect to the parameters as:
And finally, we can calculate the gradient of the loss function as:
This, lastly, lets us optimize the model through gradient descent.
5. Conclusion
In this article, we studied the definition of cross-entropy. We started from the formalization of entropy for univariate probability distributions. Then, we generalized to bivariate probability distributions and their comparison.
Further, we analyzed the role of cross-entropy as a loss function for classification models.
In relation to that, we also studied the problem of its minimization through gradient descent for parameter optimization.