Introduction
Classification is a two-step process, learning step and prediction step, in machine learning. In the learning step, the model is developed based on given training data. In the prediction step, the model is used to predict the response for given data. Decision Tree is one of the easiest and popular classification algorithms to understand and interpret.
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving regression and classification problems too.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches. A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into subtrees.
It is a graphical representation for getting all the possible solutions to a problem/decision based on given conditions.
In Decision Trees, for predicting a class label for a record we start from the root of the tree. We compare the values of the root attribute with the record’s attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node.
In order to build a tree, we use the CART algorithm, which stands for Classification and Regression Tree algorithm.
Important Terminology related to Decision Trees
- Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
- Splitting: It is a process of dividing a node into two or more sub-nodes.
- Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
- Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
- Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
- Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
- Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.
Assumptions
- In the beginning, the whole training set is considered as the root.
- Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
- Records are distributed recursively on the basis of attribute values.
- Order to placing attributes as root or internal node of the tree is done by using some statistical approach.
How Decision Tree works
The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria are different for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable. The decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.
The algorithm selection is also based on the type of target variables.
Below are some algorithms used in Decision Trees:
- ID3 → (extension of D3)
- C4.5 → (successor of ID3)
- CART → (Classification And Regression Tree)
- CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing classification trees)
- MARS → (multivariate adaptive regression splines)
The most widely used among them are ID3 and CART. Thus, we will discuss these two algorithms here.
ID3
Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute. It calculates how much information a feature provides us about a class. According to the value of information gain, we split the node and build the decision tree. A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having the highest information gain is split first. It can be calculated using the below formula: \[ Information \hspace{1.25mm} Gain= Entropy(S) - [(Weighted \hspace{1.25mm} Avg) * Entropy(each \hspace{1.25mm} feature) \]
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy can be calculated as: \[ Entropy(s) = -P(yes)\log P(yes) - P(no) \log P(no) \] Where,
- S= Total number of samples
- P(yes)= probability of yes
- P(no)= probability of no
Gini Index
Gini index is a measure of impurity or purity used while creating a decision tree in the CART(Classification and Regression Tree) algorithm. An attribute with the low Gini index should be preferred as compared to the high Gini index. It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits. Gini index can be calculated using the below formula: \[ Gini \hspace{1.25mm} Index = 1 - \sum_{j}P_j^2 \]
Pruning
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the important features of the dataset. Therefore, a technique that decreases the size of the learning tree without reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:
- Cost Complexity Pruning
- Reduced Error Pruning
Advantages
- Simple to understand, interpret and visualize.
- Little effort is required is required for data preparation.
- It can handle both numerical and categorical data.
- Non linear parameters don't effect its performance i.e. even if the data doesn't fit an easy curved graph, we can use it to create an effective decision.
Disadvantages
- Overfitting: Overfitting occurs when an algorithm captures noise in the data. So, it starts solving for one specific case rather than a general solution.
- High Variance: The model can get unstable due to small variations in data.
- Low biased Tree: A highly complicated Decision Tree tends to have a low bias which makes it difficult for the model to work with new data.