- Tree-based methods are supervised learning algorithms that stratify the predictor space to make predictions.
- Decision trees are simple to interpret but may not be as competitive as other methods in terms of prediction accuracy.
- Ensembles of trees, such as bagging, boosting, and random forests, can improve prediction performance.
- Classification trees are used when the response is a categorical variable.
Outline of first video
- Introduction
- Tree-based methods are supervised learning algorithms that stratify the predictor space to make predictions.
- They are called decision trees because of their branching structure.
- The CART (Classification and Regression Trees) algorithm is a popular example of a tree-based method.
- Trees are simple to interpret but may not be as competitive as other methods in terms of prediction accuracy.
- Ensembles of trees, such as bagging, boosting, and random forests, can improve prediction performance.
- Decision Trees
- Example: Predicting baseball player salary based on years in the league and hits per season.
- Process:
- The data is split into regions based on predictor values.
- Each region is assigned a prediction, typically the average response value for observations in that region.
- Splits are chosen to minimize the variation of observations within each region.
- Terminology:
- Internal nodes: Nodes that are further split.
- Terminal nodes (leaves): Nodes that are not further split.
- Interpretation: The tree structure itself represents the model.
- Tree Growing Algorithm
- Greedy approach:
- Starts with the full dataset.
- Finds the best split (predictor and split point) to minimize the variation within each resulting region.
- Repeats the process for each resulting region until a stopping criterion is met.
- Stopping criteria:
- Predefined number of nodes.
- Minimum number of observations in a node.
- Other criteria based on model complexity or performance.
- Greedy approach:
- Challenges and Limitations
- Overfitting: Trees can be prone to overfitting the training data, leading to poor generalization performance.
- Instability: Small changes in the training data can lead to large changes in the tree structure.
- Limited flexibility: Trees can struggle to capture complex relationships between predictors and the response.
- Ensembles of Trees
- Bagging: Creates multiple trees using bootstrap samples of the training data and averages their predictions.
- Boosting: Creates a sequence of trees, where each tree focuses on correcting the errors of the previous trees.
- Random forests: Creates multiple trees using random subsets of predictors at each split.
- Conclusion
- Tree-based methods are versatile and interpretable tools for supervised learning.
- Ensembles of trees can significantly improve prediction accuracy.
- Further research is ongoing to develop more robust and efficient tree-based methods.
Outline of second video
- The process of building a tree and making predictions
- The question of how large should the tree be
- The cost complexity pruning
- The summary of the tree growing algorithm
- The result of cross validation
Outline of third video
- Introduction
- Regression trees are used when the response is quantitative.
- Classification trees are used when the response is a categorical variable.
- The technology is very similar, but the loss function and how to measure good performance are different.
- Classification Trees
- Each observation is classified to the most commonly occurring class in the terminal node.
- The tree is grown in the same way as for regression trees.
- The criterion for making the splits is different.
- One criterion is the classification error rate.
- Another criterion is the Gini index, which is a measure of variability across the classes.
- The deviance or cross-entropy is another criterion that is similar to the Gini index.
- Example: Heart Data
- The heart data has a binary response called HD.
- There are 303 patients.
- There are 13 predictors.
- A tree was grown using cross-validation.
- The full tree is quite bushy.
- The pruned tree has a size of 6.
- The classification performance of the pruned tree is 25%.
- Comparison of Trees and Linear Models
- Trees are not always the best method.
- Linear models are better for some problems.
- Trees are better for other problems.
- Advantages and Disadvantages of Trees
- Advantages:
- Simple to understand.
- Can handle qualitative predictors.
- Disadvantages:
- Do not predict as well as more state-of-the-art methods.
- Ensemble Methods
- Ensemble methods combine trees to improve prediction accuracy.
- One example of an ensemble method is random forests.
Outline of fourth video
- Introduction
- Bagging is a method of using trees to improve their prediction error.
- It involves creating multiple trees and averaging their predictions.
- The idea was proposed by Leo Breiman.
- Bagging Process
- Bootstrap samples are drawn from the training set with replacement.
- A tree is grown on each bootstrap sample.
- The predictions of all trees are averaged to obtain the final prediction.
- Advantages of Bagging
- Reduces variance by averaging multiple trees.
- Can be applied to both regression and classification problems.
- No need to prune trees.
- Out-of-Bag Error
- A free method for estimating the test error.
- Uses observations not included in the bootstrap sample to predict their values.
- Provides an estimate of leave-one-out cross-validation.
- Random Forests
- An extension of bagging that further reduces correlation between trees.
- At each split, only a random subset of predictors is considered.
- Improves prediction accuracy over bagging.
- Example: Heart Data
- Compares the performance of bagging and random forests on the heart data.
- Shows that random forests can slightly improve prediction accuracy over bagging.
- Example: Gene Expression Data
- Applies random forests to a high-dimensional gene expression dataset.
- Demonstrates the effectiveness of random forests in classifying cancer types.
- Highlights the importance of pre-screening genes using variance.
- Conclusion
- Bagging and random forests are powerful methods for improving tree-based predictions.
- They are widely used in various applications.
- Out-of-bag error provides a convenient way to estimate test error.
Outline of fifth video
Statistical Learning: 8.5 Boosting
- Boosting
- Sequential method
- Builds on previous trees to improve performance
- Fits trees to residuals
- Shrinks trees to avoid overfitting
- Effective with smaller trees
- Tuning Parameters
- Depth of the tree (d)
- d=1: stump (single split)
- Larger d: allows interactions between predictors
- Number of trees (B)
- Typically not a critical parameter
- Shrinkage parameter (lambda)
- Controls the amount each tree is added to the model
- Depth of the tree (d)
- Variable Importance
- Measures the total drop in RSS for a given predictor over all splits in the tree
- Provides a qualitative ranking of variable importance
- Summary
- Ensembles of trees can improve prediction accuracy
- Random forests and boosting are state-of-the-art techniques for supervised learning
- Interpretation can be more challenging with ensembles
Outline of sixth video
Introduction
- What is BART?
- Ensemble method using decision trees as building blocks
- Similar to random forests and boosting
BART Algorithm:
Number of Trees (k): Determines the number of trees in the ensemble.
Number of Iterations (b): Controls the number of times each tree is perturbed.
Perturbations:
Adding branches
Deleting branches
Changing predicted values at terminal nodes
Prediction: Averaging predictions from all trees and iterations after a burn-in period.
Comparison with Other Methods
- Bagging and Random Forests: Similar in using random samples to build trees.
- Boosting: Similar in using residuals to guide tree construction.
- Differences: BART uses perturbations to modify existing trees instead of building new trees from scratch.
- Chain Monte Carlo (MCMC) method
Example: Heart Data
- Compare BART with boosting and random forest
- BART shows slower training error but competitive test error
- Less prone to overfitting than boosting
- Choosing Parameters
- Number of trees (k)
- Number of iterations (b)
- Burn-in period (l)
- Reasonable choices: k = 200, b = 1000, l = 100
Advantages of BART
- Works well out-of-the-box without much tuning
- Provides uncertainty estimates (quantiles)
Conclusion
- BART is a powerful ensemble method for regression
- Combines features of random forests and boosting
- Provides a flexible and robust approach to regression problems
Outline of seventh video
- Introduction:
- Overview of tree-based methods
- Focus on random forest and boosting
- Introduction of single tree-based method
- Imports:
- Import necessary libraries
- Decision Tree Regression
- Fitting a regression decision tree
- Data set: Boston data set
- Pre-processing: feature selection
- Validation: split data into training and test sets
- Visualizing the decision tree
- Evaluating accuracy: mean squared error
- Ensemble Methods:
- Bagging and Random Forest
- Difference between bagging and random forest
- Fitting a random forest regressor
- Evaluating accuracy
- Increasing the number of trees
- Feature importance
- Boosting:
- Difference between boosting and random forest
- Fitting a boosting regressor
- Evaluating accuracy
- Plotting training and test error
- Tuning parameters
- Conclusion:
- Summary of tree-based methods
- Comparison of random forest and boosting
Slides and Chapter
Reuse
Citation
@online{bochman2024,
author = {Bochman, Oren},
title = {Chapter 8: {Trees}},
date = {2024-08-01},
url = {https://orenbochman.github.io/notes-islr/posts/ch08/},
langid = {en}
}