5. Decision Trees¶
Tree-Based Methods 1¶
- The text discusses tree-based methods for regression and classification.
- These involves stratifying or segmenting the
predictor space
into a number of simple regions. - Other methods based on combining large number of trees include:
- Bagging
- Random Forests
- Boosting
- We choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model.
Regression Trees¶
- Recursive binary splitting:
- A method is used to split the predictor space and build the tree.
- It is a top-down, greedy approach.
- Each split is made to minimize the residual sum of squares (RSS) of the generated two branches.
- Tree pruning:
- The recursive binary splitting procedure described above produces a very large tree.
- Large trees are too complex and not interpretable.
- Trees with fewer branches have lower variance, better interpretation, but may suffer from high bias.
- Pruning is a method that is being applied to reduce the size of the original tree and obtain a more useful subtree.
- The goal is to select the subtree that leads to the lowest test error rate.
- A cost complexity parameter
α
is used to control the size of the tree in a process named weakest link pruning.
- Algorithm 8.1 Building a Regression Tree:
- Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
- Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of
α
. - Use K-fold cross-validation to choose
α
. That is, divide the training observations into K folds. For eachk = 1, 2, . . . , K
:- Repeat Steps 1 and 2 on all but the
kth
fold of the training data. - Evaluate the mean squared prediction error on the data in the left-out
kth
fold, as a function ofα
. - Average the results for each value of
α
, and pickα
to minimize the average error.
- Repeat Steps 1 and 2 on all but the
- Average the results for each value of
α
, and pickα
to minimize the average error. - Return the subtree from Step 2 that corresponds to the chosen value of
α
.
Classification Trees¶
- A classification tree is very similar to a regression tree, except that it is used to predict a qualitative response rather than a quantitative one.
- The predicted response for an observations is the most commonly occurring class of training observations in the region to which it belongs.
- The classification error rate is the fraction of the training observations in that region that do not belong to the most common class.
- We use the classification error rate to decide on the best split, but in practice it is not very sensitive, so two other variables are used:
- Gini index: a measure of total variance across the K classes, aka. purity, a small value indicates that a node contains predominantly observations from a single class.
- Cross-entropy.
- The
classification error rate
,Gini index
andcross-entropy
are used to prune the tree, but the classification error rate is preferred for more accurate results.
8.2.1 Bagging¶
- Bagging is a general-purpose procedure for reducing the variance of a statistical learning method.
- Also known as bootstrap aggregation.
- Averaging a set of observations reduces variance.
- Take multiple training sets from the population, build a separate prediction model using each training set, and average the resulting predictions.
- The trees that are generated from different sets are not pruned, and are grown to the maximum size.
- Then we average all of this individual trees into one tree.
- -Out of Bag (OOB) error estimate: the error rate for the bagged model is estimated by the average error for each observation, using only the trees that were not fit using that observation.
8.2.2 Random Forests¶
- Random Forests is a slight variation of bagging that has even better performance.
- In bagging, each time a split in a tree is considered, a random sample of
m
predictors is chosen as split candidates from the full set ofp
predictors. - A fresh sample of
m
predictors is taken at each split, and typically we choosem ≈ √p
. - In random forests, we use the bootstrap sampling as in bagging, but for each split, only a random sample of
m
predictors is chosen as split candidates from the full set ofp
predictors.
8.3 Boosting¶
- Boosting is a general approach that can be applied to many statistical learning methods for regression or classification.
- Boosting works in the opposite direction of bagging and random forests.
- Bagging and random forests involve independently growing multiple trees in order to reduce the variance.
- Boosting involves growing trees sequentially.
- Each tree is grown using information from previously grown trees.
- Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.
- The modification takes the form of weights on the observations.
- The weights on the observations indicate their importance.
Entropy and Information Gain 2¶
- Entropy is a measure of uncertainty associated with a random variable or the impurity of an arbitrary collection of examples.
- Information gain is the expected reduction in entropy caused by partitioning the examples according to a given attribute.
- Entropy is defined as: \(Entropy(X) = -∑ p(x) . log_2 p(x)\) where
p(x)
is the probability of the occurrence of eventx
. - Each branch in the tree has its own entropy.
- Information gain is the difference between the entropy of the parent node and the weighted average of the entropy of the child nodes.
- Rocchio Method:
- Similar to the k-nearest neighbor algorithm, but instead of taking the majority vote of the k-nearest neighbors, it takes the average of the k-nearest neighbors.
- The average is the centroid of the k-nearest neighbors.
References¶
-
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning with applications in R. New York, NY: Springer. Read Chapter 8: Tree-Based Methods, p. 303-335. ↩
-
Review the presentation which details how to compute entropy, information gain as part of an algorithm to ‘learn’ a decision tree. Available from http://www.math.unipd.it/~aiolli/corsi/0708/IR/Lez12.pdf ↩