Decision trees is a supervised learning algorithm that uses a pre-defined target variable to make decisions based on the training dataset. It is efficient in classification and regression problems, and can work with both categorical data as well as continuous input and output data.
This is why the algorithm is widely used for understanding and managing consumer behaviour, detecting fraudulent financial statements, energy consumption, fault diagnosis in the engineering domain, among much else.
Decision trees are actually fairly easy to implement and execute. But, as with all things complicated, this is only once you’ve developed an understanding of how they work. In this article, we’ll understand the basics, take you through a detailed implementation of the decision trees algorithm and run through the advantages and disadvantages of the algorithm.
Decision Trees Algorithm: The Basics
Before delving too deep, let us quickly understand the construct of trees in the context of computer science. In computer science terminology, a tree represents a data structure with nodes connected by edges.
1. Root node: The topmost node in the hierarchy is called a root node. It is the first node in the hierarchy. Finding out which part of the dataset should be the root node is a big part of this algorithm.
2. Parent and child nodes: The parent-child analogy, which we use in human relationships, can be applied to nodes in the tree as well. These nodes hold the data and decide the way data traverses within the tree structure.
3. Internal node: A node with at least one child is called an internal node. Internal nodes actually become hubs where the decision rules are formed using the tree data structure.
4. Leaf node: A node with no child, which essentially represents the last node in the branch, is called a leaf node. Sometimes, it is also called external node. Leaf nodes are the end results of the tree traversal, which hold the data at the lowest level in the tree.
The terms mentioned above may be confusing, but we just wanted to give you a glimpse of what a decision tree looks like and point out its main components. Take a closer look at the graphic. Does it not seem to be based on how we make decisions all day long?
In fact, our brain creates a number of decision trees daily. Each time we run one of these decision trees in our brains, we learn from the situations (which options to consider, which options not to consider, etc).
A model based on the decision tree algorithm also learns and train itself from the dataset it is fed with and can forecast the output for new circumstances, as and when it is asked.
As an aside, if you want to quickly get into this algorithm, you would really benefit from knowing a programming language like Python or R, as you will have easy access to plenty of resources to develop your own decision tree models.
Advantages of Using Decision Trees
1. Easy to understand: Decision trees are simple to understand and interpret. Trees can be graphically represented, which makes it easy to understand. All you need to do is, give a brief explanation to a layman and show the graphical representation and they would understand decision trees. The algorithm also replicates human decision making more closely than any other approaches.
2. Numerical and categorical data: Most machine learning and data analysis algorithms either handle qualitative data or quantitative data. The unique thing about decision trees is that it can handle both numerical and categorical data. It also works well with small datasets as well as large datasets.
3. Little data preparation: It needs a little to no data preparation. Most of the other techniques need to create dummy variables for the data which is not present in the training dataset. It’s possible to validate a model using statistical tests. It is also valid for non-statistical approach that makes no assumptions of the training data or predictions.
4. Feature selection: Decision trees have inherent ability of feature selection and hence it can clean up irrelevant features in subsequent runs. It is robust against most of the common machine learning issues such as co-linearity, particularly boosting.
Disadvantages of Using Decision Trees
Accuracy: Decision trees may not produce accurate results in some cases especially when compared to some of the other sophisticated machine learning approaches. Also, small changes in the training data can create a huge change in the resulting decision trees. This is a problem for a continuously changing dataset.
Overfitting: Decision trees are built on top of approaches like greedy algorithms and divide and conquer. These approaches always try to find optimal solutions, which in turn creates issues such as overfitting. It makes the whole exercise of building decision trees useless.
Untrained data: When it comes to categorical data, variables with a greater number of levels are preferred to building a tree. It produces results with a bias, which may not work with the untrained data.
Decision Trees Algorithm: The Approach
Decision trees follow a recursive approach to process the dataset through some basic steps. Here’s the gist of the approach:
- Make the best attribute of the dataset the root node of the tree, after making the necessary calculations.
- Create subsets of the data, based on the attribute you’ve selected in step 1. These subsets should each contain data with the same value for an attribute.
- Repeat step 1 and step 2 on each internal node until you find the leaf node in every branch of the tree.
- The important step in the entire decision tree algorithm is to identify the pre-defined target variables (aka the best attribute) from the data attributes, as this is the key element around which the entire tree is built. The most popular methods for performing this operation are information gain and gini index.
Let’s take an example to fully understand what we’re talking about:
Flight Data
We have a collection of weather data for a city in which a VIP is expected to land. Based on the weather data that we have collected, we can create a decision tree and predict whether the weather conditions are suitable for it to land or not.
Here’s the weather data that we have got:
Sr.No | Outlook | Temperature | Humidity | Wind | Should the flight land? |
1 | 4.8 | 3.4 | 1.9 | 0.2 | Yes |
2 | 5 | 3 | 1.6 | 0.2 | Yes |
3 | 5 | 3.4 | 1.6 | 0.4 | Yes |
4 | 6.3 | 3.3 | 4.7 | 1.6 | No |
5 | 5.2 | 3.4 | 1.4 | 0.2 | Yes |
6 | 4.7 | 3.2 | 1.6 | 0.2 | Yes |
7 | 6.9 | 3.1 | 4.9 | 1.5 | No |
8 | 5.4 | 3.4 | 1.5 | 0.4 | Yes |
9 | 7 | 3.2 | 4.7 | 1.4 | No |
10 | 6.4 | 3.2 | 4.5 | 1.5 | No |
11 | 4.8 | 3.1 | 1.6 | 0.2 | Yes |
12 | 5.5 | 2.3 | 4 | 1.3 | No |
13 | 6.5 | 2.8 | 4.6 | 1.3 | No |
14 | 5.7 | 2.8 | 4.5 | 1.3 | No |
15 | 5.2 | 3.5 | 1.5 | 0.2 | Yes |
16 | 4.9 | 2.4 | 3.3 | 1 | No |
So let’s walk through the process of predicting whether or not the flight would be allowed to land. Based on the data table, we have four weather attributes: outlook, temperature, humidity and wind. One of these will become the root node. We don’t know yet, though, which of these it will be. We can do so, however, using two methods: information gain and gini index.
Method #1: Calculating information gain
Information gain is a really important concept in decision trees, but before we get to it, we must ensure that we understand the term entropy. Entropy is a measurement of the degree of uncertainty in a variable X. It tells us how disorganised a particular point is in the dataset.
Given a collection of data S, which have both positive and negative outcomes of some target concept, the entropy of S relative to this boolean classification is:
Here, p+ and p- are the number of positive and negative outcomes in S.
If all outcomes are positive (p+ = 1), then p- is 0, and
Entropy(S) = -1. log2(1) – 0. log2 0
= -1. 0 – 0. log2 0
= 0.
If the data contains an equal number of positive and negative outcomes, the entropy is 1. If the collection contains unequal numbers of positive and negative outcomes, the entropy is somewhere between 0 and 1.
Information gain estimates the expected reduction in entropy. The attribute with the most entropy reduction is the best choice. This attribute then goes into the decision node.
The mathematical representation of this concept is as follows:
S = Each value v of all possible values of attribute A;
Sv = Subset of S for which attribute A has value v;
|Sv| = Number of elements in Sv;
|S| = Number of elements in S
This might look a bit jarring in the first go. However, it’s very easy to apply. Let’s apply it to the weather data set that we had discussed earlier. We’ll recall the attributes and their values from the table and, for the sake of this example, categorize them using random values.
I. Categorizing the values:
Given the values in the dataset, we can categorise them as follows:
outlook ( >=5 and <5 );
temperature ( >=3 and <3 );
humidity ( >=4.2 and 4.2 );
wind = (>=1.4 and 1.4 )
II. Objective:
The objective of this exercise is to determine which attribute will be the best attribute for the root node in our decision tree.
III. Calculating entropy:
We have the weather data for 16 days, based on which we’ll establish whether or not it’s possible for the flight to land. Out of 16 days, we have 8 outcomes that say ‘yes’, the game can be played, and 8 outcomes that say ‘no’, the game cannot be played.
Based on these findings and the rules for the entropy that we have established earlier,
Entropy(S) = -1*( (p(+ve)*log( p(+ve)) + (p(-ve)*log( p(-ve)) )
= -1*( (8/16)*log2(8/16)) + (8/16) * log2(8/16) )
= 1
IV. Information Gain for Outlook:
Outlook has value >=5 for 12 records out of 16 and 4 records with value <5 value.
For Outlook >= 5 & class == positive: 5/12
For Outlook >= 5 & class == negative: 7/12
Entropy(5,7) = -1 * ( (5/12)*log2(5/12) + (7/12)*log2(7/12)) = 0.9799
For Outlook <5 & class == positive: 3/4
For Outlook <5 & class == negative: 1/4
Entropy(3,1) = -1 * ( (3/4)*log2(3/4) + (1/4)*log2(1/4)) = 0.81128
Entropy(Target, Outlook) = P(>=5) * E(5,7) + P(<5) * E(3,1)
= (12/16) * 0.9799 + (4/16) * 0.81128 = 0.937745
Information Gain(IG) = E(Target) – E(Target,Outlook) = 1- 0.9337745 = 0.062255
V. Information Gain for Temperature
Temperature has value >=3 for 12 records out of 16 and 4 records with value <5 value.
For Temperature >= 3 & class == positive: 8/12
For Temperature >= 3 & class == negative: 4/12
Entropy(8,4) = -1 * ( (8/12)*log2(8/12) + (4/12)*log2(4/12)) = 0.39054
For Temperature <3 & class == positive: 0/4
For Temperature <3 & class == negative: 4/4
Entropy(0,4) = -1 * ( (0/4)*log2(0/4) + (4/4)*log2(4/4)) = 0
Entropy(Target, Temperature) = P(>=3) * E(8,4) + P(<3) * E(0,4)
= (12/16) * 0.39054 + (4/16) * 0 = 0.292905
Information Gain(IG) = E(Target) – E(Target,Temperature) = 1- 0.292905= 0.707095
VI. Information Gain for Humidity
Humidity has value >=4.2 for 6 records out of 16 and 10 records with value <4.2 value.
For Humidity >= 4.2 & class == positive: 0/6
For Humidity >= 4.2 & class == negative: 6/6
Entropy(0,6) = 0
For Humidity < 4.2 & class == positive: 8/10
For Humidity < 4.2 & class == negative: 2/10
Entropy(8,2) = 0.72193
Entropy(Target, Humidity) = P(>=4.2) * E(0,6) + P(< 4.2) * E(8,2) = (6/16) * 0 + (10/16) * 0.72193 = 0.4512
Information Gain(IG) = E(Target) – E(Target,Humidity) = 1- 0.4512= 0.5488
VII. Information Gain for Wind
Wind has value >=1.4 for 5 records out of 16 and 11 records with value <5 value.
For Wind >= 1.4 & class == positive: 0/5
For Wind >= 1.4 & class == negative: 5/5
Entropy(0,5) = 0
For Wind < 1.4 & class == positive: 8/11
For Wind < 14 & class == negative: 3/11
Entropy(8,3) = -1 * ( (8/11)*log2(8/11) + (3/11)*log2(3/11)) = 0.84532
Entropy(Target, Wind) = P(>=1.4) * E(0,5) + P(< 1.4) * E(8,3)
= 5/16 * 0 + (11/16) * 0.84532 = 0.5811575
Information Gain(IG) = E(Target) – E(Target,Wind) = 1- 0.5811575 = 0.41189
From the above calculations, let’s build a decision tree. The attribute with the best value should be positioned as the root and the branch with entropy 0 will be converted to a leaf node. A branch with entropy more than 0 will be split further.
Now, how do we use this tree which is constructed?
The tree helps us predict whether the flight would be able to land or not. Let’s consider a dataset with outlook 5, temperature 3.3, humidity 2.7 and wind 1.2. As temperature >=3.0, it moves to the humidity node. As humidity is < 4.2, it moves to the wind node. As it turns out that it is positive for wind, too, the flight can land (as per this model, of course)!
Method #2: Calculating Gini Index
Gini index is a method to measure how often a randomly chosen attribute would be incorrectly identified. Of course, we’re looking for an attribute with a lower gini index.
It is very important to note that, gini index should not be confused with gini coefficient. Both gini index and gini coefficient can be applied for data analysis and machine learning applications, but gini index (sometimes called as gini impurity) can be applied to multiclass datasets, while gini coefficient works well only with binary classification.
Yes, this mathematical expression is much simpler than the one for information gain. Now, we’ll apply it to the weather data collection that we have.
I. Gini index for Outlook
Outlook has value >=5 for 12 records out of 16 and 4 records with value <5 value.
For Outlook >= 5 & class == positive: 5/12
For Outlook >= 5 & class == negative: 7/12
gini(5,7) = 1- ( (5/12)2 + (7/12)2 ) = 0.4860
For Outlook <5 & class == positive: 3/4
For Outlook <5 & class == negative: 1/4
gini(3,1) = 1- ( (3/4)2 + (1/4)2 ) = 0.375
gini(Target, Outlook) = (12/16) * (0.486) + (4/16) * (0.375) = 0.45825
II. Gini Index for Temperature
Temperature has value >=3 for 12 records out of 16 and 4 records with value <5 value.
For Temperature >= 3 & class == positive: 8/12
For Temperature >= 3 & class == negative: 4/12
gini(8,4) = 1- ( (8/12)2 + (4/12)2 ) = 0.446
For Temperature <3 & class == positive: 0/4
For Temperature <3 & class == negative: 4/4
gin(0,4) = 1- ( (0/4)2 + (4/4)2 ) = 0
gini(Target, Temperature) = (12/16) * 0.446 + (4/16) * 0 = 0.3345
III. Gini Index for Humidity
Humidity has value >=4.2 for 6 records out of 16 and 10 records with value <4.2 value.
For Humidity >= 4.2 & class == positive: 0/6
For Humidity >= 4.2 & class == negative: 6/6
gini(0,6) = 1- ( (0/8)2 + (6/6)2 ) = 0
For Humidity < 4.2& class == positive: 8/10
For Humidity < 4.2 & class == negative: 2/10
gin(8,2) = 1- ( (8/10)2 + (2/10)2 ) = 0.32
gini(Target, Humidity) = (6/16) * 0+ (10/16) * 0.32 = 0.2
III. Gini Index for Wind
Wind has value >=1.4 for 5 records out of 16 and 11 records with value <1.4 value.
For Wind >= 1.4 & class == positive: 0/5
For Wind >= 1.4 & class == negative: 5/5
gini(0,5) = 1- ( (0/5)2 + (5/5)2 ) = 0
For Wind < 1.4 & class == positive: 8/11
For Wind < 1.4 & class == negative: 3/11
gin(8,3) = 1- ( (8/11)2 + (3/11)2 ) = 0.397
gini(Target, Wind) = (5/16) * 0+ (11/16) * 0.397 = 0.273
As humidity comes with a low gini index, it becomes the root node. Now, let’s construct the decision tree.
Now, how do we use this tree which has been constructed? The tree helps us predict whether the flight would be able to land or not. Let’s consider a dataset with outlook 5, temperature 3.3, humidity 2.7 and wind 1.2 (the same data we considered in the last example).
Here, as humidity < 4.2, it moves to the wind node. As wind is < 1.4, it moves to temperature, which also has a positive value. This indicates that the flight can land.
You will see how interesting the root nodes and internal nodes change for the same data as the method to determine the target attribute is changed.
Note: Why we excluded the outlook attribute?
You’re probably wondering why there’s no sign of outlook in the decision tree. Well, we aren’t considering the outlook attribute because it has the least information gain value and the highest gini index. Also, if we try to fit the data further with outlook, we might end up overfitting the model to the training data.
With this, we have understood how to identify a target variable to construct the decision tree, based on the given data and how to potentially predict future outcomes.
Regression Trees
So far, we have constructed the decision trees based on the classification approach. When data is continuous and quantitative, however, this will not do. We would instead need to use regression trees.
Approach: Regression trees use a strategy called binary recursive partitioning. It is an iterative process that splits the data into branches and continues to do so into smaller data sets as it moves up the branch.
Given a dataset, all the records are grouped into same partition. Then, the binary recursive partitioning creates the first two partitions, using every possible binary split on every attribute.
The split is selected using a method called ‘least squared deviation’. It is a very basic method for linear regression, widely used in machine learning applications. This approach is applied recursively to every partition until each node becomes a leaf node in the tree.
This is the formula to calculate sum of the squares of the distances from a predefined minima in the data set. This formula finds its optima, when this sum is minimum.
The Problem of Overfitting in Decision Trees
This is one of the biggest challenges for decision tree methodologies. The process of constructing decision trees takes place recursively and always tries to find an optimal solution, which leads to the problem of overfitting.
Overfitting, as the name suggests, is a phenomenon in which decision trees try to accommodate the training data set so tightly, such that they would produce inaccurate outcomes for the untrained data. It basically makes the whole learning goal difficult as the forecast and predictions start deviating from the actual results.
This happens because the approaches considered to build the tree try to apply strict rules on the branches where sparse training data is tightly fitted. In effect, this works very accurately with the available training data set, but it fails when fed with unknown and untrained data.
Pruning in Decision Trees
Pruning is a popular approach to address overfitting in decision trees. The literal meaning of the word prune is to cut away a branch from a tree. It holds true in this context as well. In the context of decision trees, we trim off the branches in such a way that overall accuracy is not disturbed. Pruning is performed after the initial training is complete.
General procedure for pruning:
- In the simplest and most common way possible, the available dataset is separated into training data set and validation data set. The training dataset is used to build the decision tree and validation data set is used to estimate the impact of pruning, once the tree is built. This approach involves trial and error, because selecting training data set and validation data set from the given collection of data is fairly tricky upfront. This qualitative approach works well with a smaller dataset, where we have enough visibility about the dataset.
- The second approach is quantitative in nature, where we build the tree by using the available dataset, then apply some statistical checks to evaluate whether pruning or expanding a particular node is likely to produce any improvement beyond the training set. Those statistical tests are:
-
- Error estimation: In this test, we approximate the expected error assuming that we prune at a particular node. We also approximate the backed-up error from its child nodes. If the static error is less than or equal to the backed-up error then prune the branch, making it a leaf node. Here’s the formula to estimate the error.
- Significance testing: It estimates goodness of fit, homogeneity and independence:
-
-
- The test for goodness of fit establishes whether data from the training data set significantly differs from the validation dataset.
- The test for homogeneity establishes distribution of counts for two or more groups using the same categorical variable. For example, choice of activities, college, employment, travel data of graduates of high school are recorded over a certain amount of time. This data helps us understand how many graduates are making different choices across different data attributes over years or decades.
- The test for independence assesses whether unpaired record sets of two different variables are independent of each other. For example, polling responses from people of different countries can be used to see if one’s nationality is related to the response that they submit during the voting process.
- While building the trees using these tests, it becomes easy to determine whether the data is reaching a point where there’s no possibility of major variation in the pattern of the data. In such cases, we can confidently prune the tree.
-
Minimum Description Length Principle
This machine learning principle says that data can best be described by a model that compresses the data the best. Sounds intriguing! In simple words, learning a model for the data is about capturing the regularities in the data and any regularity in the data can be used to compress it. Thus, the more we can compress the data, the more we have learnt about it and the better we can predict it. In the context of pruning, if you start finding repetition in the data set while generating nodes, we can consider that we have reached a regularity in the data and it can be compressed by pruning it, which in turn will make learning and predicting data efficient.
Pre-pruning and Post-pruning
Pre-pruning: In this approach, we stop building the tree prior to a perfect classification of the training set. We never allow the algorithm to generate a full grown tree. Pre-pruning is also referred to as “early stopping rule”. The process of tree-building is stopped if either all the records under consideration belong to the same class or the values of different attributes are the same (which happens rarely). Pruning can be executed if we apply predefined conditions on certain data sets before the algorithm gets into action. However, in most cases, this becomes fairly restrictive and inaccurate.
Pre-pruning is definitely simple to understand and execute. However, the downside of pre-pruning is that the algorithm has to make pruning decisions while the tree is being built. It cannot guarantee an optimal decision tree model. It works well when we are dealing with a smaller data set, while trying to avoid the problem of overfitting.
Post-pruning: This approach allows for the construction of the tree with the training dataset, and after which we can apply pruning mechanisms discussed earlier. Post pruning works in a bottom-up fashion once the tree is built. Post-pruning has certain mathematical approaches, as discussed in the earlier sections of this article (error estimation, significance testing and minimum description length principle).
Compared to pre-pruning, post-pruning works more efficiently as it does not involve trial and error to identify the opportunity for pruning. Post-pruning effectively reduces overfitting problems in trees as it precisely estimates when to stop growing the tree.
Post-pruning works best when we are dealing with a large dataset for which a tree has been constructed and has an overfitting problem. It guarantees the optimal solution for overfitting problems. However, using post-pruning for a small dataset which might not create a complex decision tree, may not seem worthwhile.
It is important to understand that the decision to apply pre-pruning or post-pruning is dependent on the data set that you are dealing with.
Popular Decision Tree Algorithms
1. ID3
Iterative Dichotomiser 3 is an algorithm that helps in building decision trees in the domain of natural language processing and understanding. It uses a greedy approach by selecting the best attribute to split the dataset on each iteration. The main issue with ID3 algorithm is that it does not guarantee an optimal solution and doesn’t work efficiently with continuous data.
2. C4.5
C4.5 is considered a successor to ID3 and is seen as a workhorse in machine learning applications. The major improvements in C4.5 from ID3 are handling of both continuous and discrete data and handling of the data with missing values. Most importantly, it tries to find the optimal solution by pruning after the trees are formed.
3. CART
CART stands for Classification and Regression Trees. Interestingly, ID3 and CART were developed at the same time independently and followed a similar approach for learning decision trees from a given dataset. The example that we saw earlier, where we used information gain and gini index, come from CART algorithms. In the beginning of this article, we said that decision trees works for both classification data and continuous data. CART justifies that statement by working with both classification and regression trees.
4. CHAID
CHAID is an acronym for Chi-square automatic interaction detection. It has a very specific machine learning application and performs splits on multi-levels while computing the classification trees. Practically, it is often used for understanding consumer behaviour, medical and psychiatric research.
5. MARS
MARS stands for Multivariate Adaptive Regression Splines. It is developed to handle numerical data using trees in an effective way. This is also considered as an extension of linear regression mode, which is used to forecast data broadly in the field of biological, behavioural and social sciences.
Endnote
Decision trees follow a recursive approach to process a dataset. The important step in the entire decision tree algorithm is to identify the pre-defined target variables from the data attributes.
The most popular methods for performing this operation are information gain and gini index. Information gain estimates the expected reduction in entropy. The attribute with the most entropy reduction is the best choice.
Gini index measures how often a randomly chosen attribute would be incorrectly identified. The intention is to find the attribute with a lower gini index. Regression trees are decision trees that work with continuous and quantitative data.
Pruning is a popular approach to address overfitting in decision trees.