Sitemap

Bagging Techniques (Quick Revision)

4 min readOct 17, 2021

Random Forest Algorithm and Feature Extraction via Bagging

Photo by Steven Kamenar on Unsplash

Welcome again! Today we will quickly go through a couple of widely used Bagging Techniques. This blog is #7 of the Revision Series. If you want to revise some of the ML concepts quickly, you can check out this list.

Random Forest is a popular bagging technique for ML tasks. For Tabular data, It works pretty well, can perform well for both Regression and Classification problems. One more bagging-based Technique used for Feature Selection and to know the Importance of the features is the Extra-Tree Classifier. Let’s dive into What? Why? and When? to use these Techniques.

Random Forest

What?

Decision Trees do not perform well for validation data, as it overfits the data a lot. We have discussed this concept in the prevision section of Bagging and Boosting.

Algorithm

STEP 1: Create a Bootstrap Dataset, Bootstrap dataset can be created by selecting n random samples where duplicates are allowed. (Row Sampling)STEP 2: Create a Decision Tree Using the Bootstrapped data, but using a random subset of features at each step. Typically √m  Number features are selected. (Column Sampling)STEP 3: Go Back to STEP 1 and RepeatSTEP 4: Aggregate the ResultsWhere n is the total number of samples or Examples in the Data
Where
m is the total number of Features or Variables in Data

Diagrammatically the Random Forest can be shown as

Sample Block Diagram of the Random Forest

Since the Features are selected at random, there can be a possibility of choosing same features again and again, and some can miss out in training process. This is called Out of Bag. Typically one third (⅓) of the data is not being considered for training. So, These samples which are not used in training is used for validation. This is Known as Out of Bag Score (OOB_Score). This can be very handy to verify the model performance.

Why?

Random Forest overcome the overfitting problem of the Decision Trees. Since the algorithm is Ensemble based model, Typically gives good results. The individual Trees can learn up to some extent, but combining them will enable the model to learn even better.

When?

Random Forest is best suited for Tabular data. So, whenever the data is tabular, it is worth trying Random Forest Algorithm. If data contains outliers, a Random forest Algorithm is preferable.

Advantages

  • No Feature Scaling is Required
  • Decision Trees will highly overfit the data whereas Random Forest will reduce the high variance (overfitting)
  • Robust to outliers in Data
  • Can handle missing values
  • Out Of Bag Score can be used as a verification for the Model Performance
  • Works Well on Non Linear Data

Disadvantages

  • Random forests tend to be biased while dealing with categorical variables.
  • Also, For Multiclass Classification Problems the algorithm tend to be biased for the classes which have more frequency
  • The Imbalance Data hurts the Performance
  • If the model is large, the computational time will be high.

Feature Extraction via Bagging

The Decision Tree can be used to score a feature for its importance by computing its Information Gain or Gini Impurity. That is, whether a feature can split the data with higher Information gain or not. Similarly, Random Forest can be used to calculate the Importance score. It uses Proximity Matrix for calculating the similarity. If 2 or more sample end up in a same leaf node, then they are considered to be similar. There is a Special Algorithm can Extra Tree Classifier, which performs well compared to Random Forests. The Major difference between Extra-Tree Classifier and Random Forests lies in its sampling method.

Algorithmically, the change with the Extra tree classifier is the samples selected are without replacement (duplicates are not allowed). Also this algorithm can be used for Feature Selection or Extraction. It can also give a ranking of the Features that are important for the prediction. The Advantages and Disadvantages are almost similar to Random Forest, except this doesn't have Out of Bag score and it is an Extremely randomized algorithm, So the learning would be more better than that of RFs.

Awesome! We have revised the concepts of Random Forests and Extra Tree Classifiers.

Thank you for your precious time. Let’s revise more concepts in the future. See you next time.

--

--

Navaneeth Sharma
Navaneeth Sharma

Written by Navaneeth Sharma

ML and Full Stack Developer | Love to Write

No responses yet