1. Introduction
In this tutorial, we’ll describe the information gain. We’ll explain it in terms of entropy, the concept from information theory that found application in many scientific and engineering fields, including machine learning. Then, we’ll show how to use it to fit a decision tree.
2. Information
Intuitively, a piece of information is anything that increases our knowledge of a system, process, or phenomenon. But what is knowledge?
Putting aside philosophical details, we could say that it’s the mirror image of uncertainty. The more we’re sure about something, the greater our knowledge is. The converse is also true, and it’s the crux of the information theory: the more uncertain we are, the less we know. So, information is anything that decreases our uncertainty.
The next question we ask is how to quantify uncertainty and information. Let’s show it in an example.
2.1. Example: Uncertainty and Information
Let’s consider an opaque box containing 1 red and 2 white balls. Anyone drawing a ball randomly from it will have of a chance to get the red ball and of drawing a white one. Although we can’t be certain about the color, we can describe the outcomes and how likely they are with the following random variable:
(1)
Now, if we find out that someone drew the red ball, there will be no uncertainty about the color we can get:
(2)
The quantity of information we received by learning that someone drew the red ball is the difference between the probabilistic models (2) and (1). More specifically, it’s the difference in uncertainty, which we measure with entropy.
2.2. Entropy
In the information theory, the entropy is a function of probabilities . It satisfies three conditions we’d expect a measure of uncertainty should fulfill:
is continuous in . This condition ensures that the uncertainty doesn’t fluctuate a lot if any undergoes a small change.
If the probabilities are all equal (), then the uncertainty should increase as grows. Mathematically, should be monotone with . This is intuitive: the more outcomes there are, the more uncertain we are of any of them occurring.
Finally, any way of decomposing as a (weighted) sum of the uncertainties over the subgroups of should give the same result.
Shannon showed that the only function satisfying these conditions was:
(3)
There, is a constant, and all the logarithms have the same base. Usually, we take binary logarithms. So:
(4)
2.3. The Amount of Information Is the Difference in Entropy
Now that we have to compute the entropy, we can measure the amount of information. We define it as the change in uncertainty.
Formally, let’s say that is the random variable modeling our belief system before learning the information . Let’s denote with our updated state of belief (after ) and stretch the notation so that can act on and in addition to the probabilities.
The amount of information of is then:
(5)
2.4. Example: Calculating the Amount of Information
Let’s calculate the amount of information we gain by finding out that someone has drawn the red ball from a box that initially held one red and two white balls.
The entropy before the information is:
The entropy after is:
So, the amount of information is . In this case, we received the maximum amount since the information eliminated our uncertainty.
What if someone told us they drew a white ball instead of the red one? The probabilities would shift to:
(6)
So, our new entropy will be:
The amount is now , a negative number. But wait, didn’t we say that information should decrease uncertainty? Ideally, that would be the case. We’d receive only the pieces of information that reduce our uncertainty. However, not all information is like that. Some can make us more uncertain, and the amount of such information will be negative.
That’s why we defined the amount as the difference in entropy before and after the information, not the other way around. That way, positive numbers get to represent an increase in knowledge (i.e., a decrease in uncertainty), which is more intuitive.
From there, we see that information is anything that affects our probabilistic model about some process or phenomenon. Its contribution is positive if it decreases uncertainty, and negative otherwise.
3. The Information Gain
Now that we know how to calculate the entropy and the amount of information, we can define the information gain. We’ll do it in the context of decision trees.
3.1. Decision Trees
At each internal node in a decision tree, we inspect a feature of an object. Depending on its value, we visit one of the node’s sub-trees.
We repeat the step until we hit a leaf. The leaf nodes don’t check any feature, and each contains a subset of the training data. Once we get to a leaf, we assign the object to the majority class in the leaf’s data subset.
3.2. The Information Gain
To build a decision tree, we need to decide which feature to check at which node. For instance, let’s suppose that we have two unused features: and , both binary. We also have five objects, two of which are positive:
Which feature should we test to add a new node? The information gain can help us decide. It’s the expected amount of information we get by inspecting the feature. Intuitively, the feature with the largest expected amount is the best choice. That’s because it will reduce our uncertainty the most on average.
3.3. First, We Calculate the Entropies
In our example, the entropy before adding a new node is:
That’s the measure of our uncertainty in a random object’s class (assuming the previous checks got it to this point in the tree). If we choose as the new node’s test feature, we’ll get two sub-trees covering two subsets of :
Their entropies are:
3.4. Then, We Compute the Gain
The information gain is the expected amount of information we get by checking feature :
We define and to be the frequencies of and in , respectively.
The same calculation for shows that its gain is:
Since , we choose to create a new node.
As we see, favors the splits that maximize the purity of the subsets so that each has its majority class’s fraction as high as possible.
3.5. Formula
The calculation of shows us the formula for the gain in the general case. Let denote the dataset to be split by creating a new node. Let’s suppose that the attribute can take values: , and that is the fraction of the objects in with . Then, the information gain of is:
(7)
The gain cannot be negative even though individual pieces of information can have a negative contribution.
4. Conclusion
In this article, we showed how to calculate the information gain. Also, we talked about the entropy in the information theory and its connection to uncertainty.