1. Introduction
In this tutorial, we’ll talk about node impurity in decision trees. A decision tree is a greedy algorithm we use for supervised machine learning tasks such as classification and regression.
2. Splitting in Decision Trees
Firstly, the decision tree nodes are split based on all the variables. During the training phase, the data are passed from a root node to leaves for training. A decision tree uses different algorithms to decide whether to split a node into two or more sub-nodes. **The algorithm chooses the partition maximizing the purity of the split (i.e., minimizing the impurity). Informally, impurity is a measure of homogeneity of the labels at the node at hand:
**
There are different ways to define impurity. In classification tasks, we frequently use the Gini impurity index and Entropy.
3. Gini Impurity
Gini Index is related to the misclassification probability of a random sample.
Let’s assume that a dataset contains examples from classes. Its Gini Index, , is defined as:
(1)
where is the relative frequency of class in , i.e., the probability that a randomly selected object belongs to class .
As we can observe from the above equation, Gini Index may result in values inside the interval . The minimum value of zero corresponds to a node containing the elements of the same class. In case this occurs, the node is called pure. The maximum value of 0.5 corresponds to the highest impurity of a node.
3.1. Example: Calculating Gini Impurity
In this example, we’ll compute the Gini Indices for 3 different cases of a set with 4 balls of two different colors, red and blue:
- 4 red & 0 blue balls:
- 2 red & 2 blue balls:
- 3 red & 1 blue balls:
4. Entropy
Ιn statistics, entropy is a measure of information.
Let’s assume that a dataset associated with a node contains examples from classes. Then, its entropy is:
(2)
where is the relative frequency of class in . Entropy takes values from . As is the case with the Gini index, a node is pure when takes its minimum value, zero, and impure when it takes its highest value, 1.
4.1. Example: Calculating Entropy
Let’s compute the entropies for the same examples as above:
- 4 red & 0 blue balls:
- 2 red & 2 blue balls:
- 3 red & 1 blue balls:
5. Splitting the Data
The quality of splitting data is very important during the training phase. When splitting, we choose to partition the data by the attribute that results in the smallest impurity of the new nodes.
We’ll show how to split the data using entropy and the Gini index.
5.1. Information Gain
The information gain is the difference between a parent node’s entropy and the weighted sum of its child node entropies.
Let’s assume a dataset with objects is partitioned into two datasets: and of sizes and . Then, the split’s Information Gain () is:
(3)
In general, if splitting into subsets with objects, respectively, the split’s Information Gain () is:
(4)
5.2. Example: Splitting by Information Gain
Let’s consider dataset , which shows if we’re going to play tennis or not based on the weather:
Day
Outlook
Temperature
Humidity
Wind
Play Tennis?
D1
Sunny
Hot
High
Weak
No
D2
Sunny
Hot
High
Strong
No
D3
Overcast
Hot
High
Weak
Yes
D4
Rain
Mild
High
Weak
Yes
D5
Rain
Cool
Normal
Weak
Yes
D6
Rain
Cool
Normal
Strong
No
D7
Overcast
Cool
Normal
Weak
Yes
D8
Sunny
Mild
High
Weak
No
D9
Sunny
Cool
Normal
Weak
Yes
D10
Rain
Mild
Normal
Strong
Yes
D11
Sunny
Mild
Normal
Strong
Yes
D12
Overcast
Mild
High
Strong
Yes
D13
Overcast
Hot
Normal
Weak
Yes
D14
Rain
Mild
High
Strong
No
It contains 9 positive outcomes (Yes) and 5 negatives (No). So, the starting structure is . We want to determine the attribute that offers the highest Information Gain.
Let’s see what happens if we split by Outcome. Outlook has three different values: Sunny, Overcast, and Rain. As we see, the possible splits are and ( and stand for entropy and split):
Firstly, we calculate the entropies. They are: and . Therefore, the Information Gain based on the Outlook, , is:
(5)
Next, we split the tree based on the Wind feature. It can take two values: Weak and Strong, with the possible splits and :
The corresponding entropies are: and . Therefore, is:
(6)
Following this idea, we calculate the gains for Humidity and Temperature as well.
(7)
(8)
We see that the feature with the maximum Information Gain is Outlook. Thus, we conclude that Outlook is the best attribute to split the data at the tree’s root.
5.3. Example: Splitting by Gini Index
We can split the data by the Gini Index too. Let’s compute the required probabilities:
(9)
Out of the 14 days in the above example, Sunny, Overcast, and Rain occur 5, 4, and 5 times, respectively. Then, we compute the probabilities of a Sunny day and playing tennis or not. Out of the 5 times when Outlook=Sunny, we played tennis on 2 and didn’t play it on 3 days:
Having calculated the required probabilities, we can compute the Gini Index of Sunny:
We follow the same steps for Overcast and Rain:
Therefore, the weighted sum of the above Gini Indices is:
(10)
Similarly, we compute the Gini Index of Temperature, Humidity, and Wind:
(11)
We conclude that Outlook has the lowest Gini Index. Hence, we’ll choose it as the root node.
6. Conclusion
In this article, we talked about how we can compute the impurity of a node while training a decision tree. In particular, we talked about the Gini Index and entropy as common measures of impurity. By splitting the data to minimize the impurity scores of the resulting nodes, we get a precise tree.