A39: Decision Tree and Random Forests — Theory

Decision trees, entropy, information gain, bootstrap, bagging — Bootstrap Aggregation, random forests, ensemble learning.

9 min readFeb 10, 2022

This article is a part of “Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)
Click here for the previous article/lecture on “A38: Logistic regression vs KNN — breast cancer dataset.”

💐Click here to FOLLOW ME for new contents💐

🧘🏻‍♂️ 👉🎯 — Peacefully on target to learn the most demanding skills!

🧘🏻‍♂️Welcome guys to the Decision Tree Learning lecture!🧘🏻‍♂️

A typical decision tree created using scikit-learn for iris dataset!

Decision Trees — key concepts

>>(CART — Classification And Regression Tree)<<

Decision Trees are a non-parametric supervised machine learning modelling that is widely used for classification problems (typically as random forests ensemble learning approach). They can also be used for regression problems. In decision tree learning, we predicts the target value by learning simple decision rules inferred from the data features.

So:

Decision tree is a type of model used for both classification and regression (CART).
By answering sequential questions, tree send us down along a certain route to find the outcomes / result class.
Decision tree model behaves like “if this, then that” conditions ultimately yielding a specific result.

Let’s understand decision trees for a simple classification problem where we want to make a certain decision by creating a situation.

My son wants to practice soccer. Before I decide to go out with him, I have few questions on weather.

A very simple and basic tree based on the data given on right corner with three features (outlook, windy, humidity) and a target column “Playing > yes/no”

Root Node: The starting point of the tree, e.g. “Outlook”
Branches: Lines (sometimes shown as arrows) connecting nodes, showing the flow from question to answer — segments of the trees that connect the nodes
Leaf Node: Terminal nodes that predict the outcomes “Playing/Not Playing”
Internal nodes: Nodes that split for the value of a certain feature / attribute >> “Windy” & “Humidity”

Let’s move on and draw the routes in the tree.

Three routes are shown, a very light blue with 1 question and red with 2 questions to reach the outcome.

Both the root and the leaf nodes contain questions or criteria to be answered along the route between them.
Depth of the tree is important and represents how many questions are asked before we reach the leaf node (prediction class)
The overall tree depth is denoted by its longest route.

In our example, the depth is 2 for both Sunny and Rainy routes whereas 1 for the overcast route. Our tree has a depth of 2.

🧘🏻‍♂️>>Now, another thing that we can notice in the above tree, is Splitting at the nodes<<🧘🏻‍♂️

Well, a relevant and the most important question is “how to decide a split, that could be THE BEST under given circumstances?”

Let’s move on and try to understand this with a simple example!

Decision Trees — Splitting at nodes

We can create a table with features A, B, C and outcome Target. Notice, B is the only feature which is splitting the target exactly into tow classes — a clear separation, right? (see the split chart in the figure below)

Well, visually, it looks great and very simple, however, there is some mathematics behind these splits and its good to know that.

Let’s talk about the Mathematical Methods behind split, they are simple:

Entropy & Information Gain
Gini Index

Entropy, H, is a common way to measure the level of impurity in a group, it comes from the information theory* — higher the entropy is, more the information contents are.

* Link to the article published in 1948 “A Mathematical Theory of Communication” by Claude E. Shannon, the father of information theory.

Information Gain, IG, is based on entropy. The information gain decides which feature to split on at each step in building the decision tree.

IG = H (parent node) — H (child_nodes)

H (child_nodes) is a weighted sum of entropy of the children.

Well, knowing a formula is good but going is the best. Let’s do some mathematics to see the splitting by calculating Entropy and IG for our example dataset. We can compute the values of information gains by splitting at each feature individually.

Computed entropies using the formula given above >> We are splitting at A.

Let’s compute the Information Gain (IG).

IG = H (parent) − {(3/4) H (child_1) + (1/4) H (child_2)}
IG = 1 − {(3/4) 0.92 + (1/4) 0}
IG = 1 − {0.69 + 0}
IG = 1 − 0.69
IG = 0.31

The Information Gain (IG) is 1 if we split at B.

The Information Gain (IG) is 0 if we split at C!

Let’s create a table for all the computed values of information gains.

So, B is the best feature to split on based on the IG values!

Another way to decide on the feature to split on is using Gini Index, let’s see how to compute Gini Index and weather it suggests different feature for a split.

So, Gini Index is also suggesting to split at B.

🧘🏻‍♂️>>Note: Split is decided based on highest value of Information Gain or lowest value of Gini Index<<🧘🏻‍♂️

Decision Trees — Advantages / disadvantages

Advantages:

A decision tree is a flowchart-like diagram that shows the various outcomes from a series of decisions. A primary advantage for using a decision tree is that it is easy to follow and understand. They are fast and can handle both numerical and categorical data in the large datasets.

Disadvantages:

While decision trees are computationally cheap for predictions, training the decision tree can be computationally expensive.
In decision trees, utilizing different training and test subsets of the same dataset leads to have high variance (different splits in the training data can lead to very different trees). Decision trees can be very easy to overfit (Overfitting is creating over-complex trees that do not generalize the data well), and the results are “poor performance” on the new and unseen data. This primary weakness limits their usage in predictive modelling. However, using ensemble methods, we can create a model that utilizes underlying decision trees as a foundation to produce excellent results.

Well, its great to have understanding of Decision Trees. Other than machine learning, they are commonly used as a decision-making tool, for research analysis, or for planning strategy.

Let’s move on and discuss:

The bootstrap
Bagging
Random Forests

The Bootstrap

The bootstrap is widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.

Consider, we have a dataset with 100 values and we want to estimate the mean of the sample

mean = sum of samples or value / no. of samples or values

For such a small sample, we expect error in the mean, however, using bootstrap procedure, we can improve the estimate of our mean using the following steps:

Create many random sub-sets (say 500) of our dataset with replacement — selecting the same value multiple times.
Calculate mean of each subset.
Calculate the average of all of our collected means and use that as our estimate mean for the data.

For example, we had 5 subsets which gave the mean values of 2.5, 3.5, 5.5, 4.3 & 2.9. The estimated mean (the mean of means) in this case will be 3.74

Bagging (Bootstrap aggregating)

Bagging (an application of the bootstrap) is a general purpose procedure for reducing the high variance of machine learning algorithm, typically in decision trees. This simple and very powerful ensemble method combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.

Let’s consider that we have a sample with 5,000 instances or values. We want to use decision tree algorithm. The Bagging procedure will work as follow in this situation:

Create many random sub-sets (say 500) of our dataset with replacement.
Train the model on each sub-set.
For the new dataset, calculate and output the average prediction from each model.

For example, we had 5 bagged decision trees, making predictions for class G, G, B, G, B, the most request (mode) G will be there final prediction.

🧘🏻‍♂️ <<Decision trees are greedy! >> They choose the variable (node) to split on using a greedy algorithm that minimizes error. Even with Bagging, the decision trees can have lots of structural similarities and in turn have high correlation in their predictions.
Combining predictions from multiple models in ensembles works better if the predictions from the sub-models are uncorrelated or at best weakly correlated.

This is where the Random Forests comes in as improvement over bagged decision trees!

Random Forests

Random Forests is one of the most popular and very powerful machine learning algorithm which corrects the habit of overfitting of decision trees to their training dataset. The essence of the method is to assemble multiple trees in randomly selected subsets at training time and outputting the class that is the mode (set of data values is the value that appears most often) of the classes (in case of classification problem) or mean prediction (in case of regression problem) of the individual trees.

Suppose, one of the feature in our dataset is very strong at predicting a certain class. With bagging, most of the trees will use that feature as their top split to minimize the error. This will result in ensemble of very similar and highly correlated trees. Averaging highly correlated quantities does not significantly reduce variance and this is what we don’t want.

Random Forest does the trick:
By randomly leaving out candidate features from each split, Random Forests “de-correlates” the trees and make them independent from each other. So that, the averaging process can reduce the variance of the resulting model and it has no effect by the features that are strong at predicting a certain class.

A new random sample of features is chosen for every single tree at every single split. The number of randomly selected features (m) that can be searched at each split point is specified as a parameter to the algorithm.

Typically if p is the number of full set of the features:

For classification a good default value for m = sqrt p (If we have 25 features, m will be 5 for classification problem)
For regression a good default value for m= p/3

Where m is the number of randomly selected features that can be searched at a split point and p is the number of input variables (full set of features)

So, we have covered the key concepts that governs the decision tree learning and random forests. Let’s move on and work with data in our next lecture. Hands-on is always great way to learn!

*******************************************************************

💐Click here to FOLLOW ME for new contents💐
🌹Keep practicing to brush-up & add new skills🌹

✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅

Good luck!

See you in the next lecture on “A40: …………………….!!!”.
Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

**************************************************************************************************************************************

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.