A Complete Guide to Decision Tree Formation and Interpretation in Machine Learning

Decision Tree is one of the easiest algorithm to understand and interpret. If you are familiar with if-else statements in programming then that is pretty much all you need to know to understand how decision tree algorithm works. Even if you are not familiar with programming then also if you apply your common sense and relate with the real life examples, then you will easily understand the decision tree algorithm. For example if vehicle has fuel then it will start, if you try then you can fail or succeed, if it is rainy, match will not happen etc… So based on these conditional statement only decision tree helps us to build a model to predict something for future.

Decision Tree is supervised machine learning algorithm which is used for both types of problems regression (that is predicting the continuous value for future example house price, hours the match can be played given overcast condition etc…) and classification (that is classifying different objects into respective categories or classes for example given the overcast conditions match will be played or not, given image belongs to cat or dog etc…).

Table of Content

  • Important Definitions
  • Homogeneous and Heterogeneous Data
  • Entropy and Information Gain
  • Gini and Gini Impurity
  • Reduction in Standard Deviation
  • Different Nodes in Decision Tree
  • Different Algorithms used to build Decision Tree
  • Decision Tree for Classification
  • Building a decision tree for the given data using Entropy and Information Gain
  • Building a decision tree for the given data using Gini and Gini Index
  • Complete code to build decision tree in python
  • Decision Tree for Regression
  • Build Decision Tree using Standard Deviation Reduction strategy

Important Definitions

Homogeneous and Heterogeneous Data Nodes

Understanding homogeneous nature of data nodes is very important when it comes to decision tree. It is directly related to the predictive power of the tree build. If you pick a random data item from a fully homogeneous node then it wont be any brainier for you to tell which class it belongs to. Got the idea? if not then let me explain what is meant by homogeneous and non homogeneous data nodes.

A fully homogeneous node is one which has all the data items belonging to the same class. And as we keep on mixing the data items from different classes it becomes a non homogeneous or heterogeneous node. So a fully heterogeneous data node will be the one which has equal proportion of all classes data items. Lets understand the same by visualizing some plots as below:

Homogeneous and Heterogeneous Data Node in Decision Tree
Homogeneous and Heterogeneous Data node in Decision Tree | Source Website: www.ashutoshtripathi.com

Above pic is self explanatory, still i want to emphasize one thing that how mathematically you will measure the level of homogeneity in a node? Because this level of homogeneity helps you to decide on where to split the data and how to form the decision tree. That is where concepts of Entropy, Information Gain and Gini Impurity comes in picture.

Entropy and information Gain

Entropy is the measure of randomness in your data. More randomness means more entropy means harder to draw any conclusion or harder to classify. Lets relate it to the homogeneity concept explained above to make it easy to understand. If your data node is heterogeneous it will be hard to draw any conclusion about that data node which will lead to higher entropy.

High Entropy Means More Randomness means less Homogeneous (Heterogeneous)

Entropy of Homogeneous data nodes in decision tree
Source Website: www.ashutoshtripathi.com

So in decision tree each split is based on reduction in entropy so that we can reach towards the less heterogeneous or more homogeneous node. So we split in such a way that after every split we get more homogeneous node so that drawing conclusion becomes easy. And hence in each split there will be reduction in entropy. And this reduction in entropy is known as Information gain.

Entropy in Decision Tree
Entropy in Decision Tree | Source Website: www.ashutoshtripathi.com

in the above plot, it is visible that when we have the perfect outcome which mean probability value equal to either 1 or 0 then Entropy is 0. Mean there is no randomness involved. And when degree of randomness increases probability will be mapped in between 0 to 1. For probability = 0.5 (both events are equally likely) there will be maximum randomness and hence Entropy will be maximum that is 1.

Entropy Forumla
Entropy Formula

We will be using below data for calculation purpose (Even for the next definitions in the post)

Decision Tree Outlook Dataset | source website: www.ashutoshtripathi.com

Entropy Calculation

Entropy Calculation for different data nodes in Decision Tree | Source : www.ashutoshtripathi.com

Information Gain

Information gain is the measure of homogeneity we achieve after splitting a data node. In statistical terminology the amount of reduction in entropy after splitting the data node is known as information gain.

Information Gain Formula in Decision Tree | source: www.ashutoshtripathi.com

From above formula you can easily say that:

More Information Gain = Less Entropy(T,X) = More Homogeneous Node

This goes in align with our objective of getting the more homogeneous data node after the split. So we calculate the entropy of target with each independent variable and see where information gain is more means where we getting more homogeneous node. So the variable which has more information gain will be placed at root.

Suppose a data set contains two variables which mean two nodes. And in below image i tried to visualize that how splitting look like if we try to split by A and B separately. So if you see when we try to split with B then we get comparatively more homogeneous nodes. This means splitted nodes have more proportion of one class than the other class.

Information Gain Info graphic Decision Tree | Source: www.ashutoshtripathi.com

There is another criteria to split the node that is using Gini and Gini Impurity. Let’s understand more about Gini criteria.

Gini and Gini Impurity

Gini is calculated by using the formula — the sum of square of probability for success and failure (p²+q²)

Gini is directly proportional to homogeneity. Higher the value of gini higher the level of homogeneity after the split.

Gini Impurity is inversely proportional to the homogeneity. So while spliting we look for least gini impurity and the node which has least gini impurity is made as root node.

Gini Formula in Decision Tree | Source: www.ashutoshtripathi.com

Classification and Regression Tree (CART) algorithm uses gini score as the criteria to split the nodes in decision tree. And other important point is that Gini works only for binary split with categorical target as success or failure.

Lets take the football play example data and calculate gini for the outlook column.

Gini calculation for outlook column in decision tree | source: www.ashutoshtripathi.com

Here again similar to entropy of target with respect to independent variable, will calculate gini of target with respect to independent variable and it is also known as weighted gini.

Here also will calculate weighted gini with respect to rainy, overcast and sunny rows and sum them all to get the final gini value for outlook column.

Gini Impurity calculation example | source: www.ashutoshtripathi.com
weighted gini impurity in decision tree example | source: www.ashutoshtripathi.com

Similarly you can create the gini impurity for other independent variables and see which has least gini impurity. The one which has least gini impurity will be the root node.

Reduction in Variance or Standard Deviation Reduction — SDR

Entropy, Information Gain and Gini Impurity are used in classification problems. So what about regression? So similar to information gain that is reduction in entropy, we have reduction in variance or reduction in standard deviation which is used for regression type problem in decision tree.

outlook Dataset for decision tree | image source: google images
image source: google images

So we calculate weighted SDR with respect to independent variable and see where we get more SDR value and that will be the root node.

image source: www.ashutoshtripathi.com

To calculate standard deviation for individual attributes for example rainy outlook then consider only rainy rows data and apply the formula on hours played.

weighted standard deviation calculation in decision tree regression | source: www.ashutoshtripathi.com
Standard Deviation Reduction in regression decision tree | source: www.ashutoshtripathi.com

Standard Deviation Reduction is based on the decrease in standard deviation after the split. So the variable which gives highest SDR will be declared as the root node. And the highest SDR will corresponds to the most homogeneous nodes after the split.

Understand different nodes in Decision Trees

  1. Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets. If you consider Information gain then the node which has maximum information gain will become the root node.
  2. Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node. Basically it helps you to decide whether to choose (select or go) right sub tree or left sub tree
  3. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
  4. Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
  5. Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.
  6. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting. It is used in hyper parameter tuning in decision trees.

Different Algorithms used to build Decision Tree

Decision Tree is used for both types of problems Classification and Regression. So based on problem types it uses different algorithms.

Algorithms are also categorized based on the different splitting criteria like entropy, information gain, gini and gini index, and standard deviation reduction.

Let us look at some algorithms used in Decision Trees:

ID3 → (Iterative Dichotomiser 3, uses entropy and information for classification problems and standard deviation reduction for regression problems)
→ (successor of ID3 and used for classification)
CART → (Classification And Regression Tree. used for both classification and regression)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing classification trees)
MARS → (multivariate adaptive regression splines)

Decision Tree for Classification Problems

Building a decision tree for the given data using Entropy and Information

Please watch the video explaining decision tree formation and split using entropy and information gain.

Decision Tree formation step by step explanation

Building a decision tree for the given data using Gini and Gini Index

Please watch the video which explains decision tree formation using gini and gini index.

Complete Code to build Decision Tree for classification using Python

Please refer my post on step by step guide to build decision tree using python.


Decision Tree for Regression

Please refer my post on step by step guide to build decision tree for Regression


Originally published at http://ashutoshtripathi.com on March 29, 2022.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store