Skip to content

5. Decision Trees

Tree-Based Methods 1

  • The text discusses tree-based methods for regression and classification.
  • These involves stratifying or segmenting the predictor space into a number of simple regions.
  • Other methods based on combining large number of trees include:
    • Bagging
    • Random Forests
    • Boosting
  • We choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model.

Regression Trees

  • Recursive binary splitting:
    • A method is used to split the predictor space and build the tree.
    • It is a top-down, greedy approach.
    • Each split is made to minimize the residual sum of squares (RSS) of the generated two branches.
  • Tree pruning:
    • The recursive binary splitting procedure described above produces a very large tree.
    • Large trees are too complex and not interpretable.
    • Trees with fewer branches have lower variance, better interpretation, but may suffer from high bias.
    • Pruning is a method that is being applied to reduce the size of the original tree and obtain a more useful subtree.
    • The goal is to select the subtree that leads to the lowest test error rate.
    • A cost complexity parameter α is used to control the size of the tree in a process named weakest link pruning.
  • Algorithm 8.1 Building a Regression Tree:
    1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
    2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of α.
    3. Use K-fold cross-validation to choose α. That is, divide the training observations into K folds. For each k = 1, 2, . . . , K:
      • Repeat Steps 1 and 2 on all but the kth fold of the training data.
      • Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of α.
      • Average the results for each value of α, and pick α to minimize the average error.
    4. Average the results for each value of α, and pick α to minimize the average error.
    5. Return the subtree from Step 2 that corresponds to the chosen value of α.

Classification Trees

  • A classification tree is very similar to a regression tree, except that it is used to predict a qualitative response rather than a quantitative one.
  • The predicted response for an observations is the most commonly occurring class of training observations in the region to which it belongs.
  • The classification error rate is the fraction of the training observations in that region that do not belong to the most common class.
  • We use the classification error rate to decide on the best split, but in practice it is not very sensitive, so two other variables are used:
    • Gini index: a measure of total variance across the K classes, aka. purity, a small value indicates that a node contains predominantly observations from a single class.
    • Cross-entropy.
  • The classification error rate, Gini index and cross-entropy are used to prune the tree, but the classification error rate is preferred for more accurate results.

8.2.1 Bagging

  • Bagging is a general-purpose procedure for reducing the variance of a statistical learning method.
  • Also known as bootstrap aggregation.
  • Averaging a set of observations reduces variance.
  • Take multiple training sets from the population, build a separate prediction model using each training set, and average the resulting predictions.
  • The trees that are generated from different sets are not pruned, and are grown to the maximum size.
  • Then we average all of this individual trees into one tree.
  • -Out of Bag (OOB) error estimate: the error rate for the bagged model is estimated by the average error for each observation, using only the trees that were not fit using that observation.

8.2.2 Random Forests

  • Random Forests is a slight variation of bagging that has even better performance.
  • In bagging, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors.
  • A fresh sample of m predictors is taken at each split, and typically we choose m ≈ √p.
  • In random forests, we use the bootstrap sampling as in bagging, but for each split, only a random sample of m predictors is chosen as split candidates from the full set of p predictors.

8.3 Boosting

  • Boosting is a general approach that can be applied to many statistical learning methods for regression or classification.
  • Boosting works in the opposite direction of bagging and random forests.
  • Bagging and random forests involve independently growing multiple trees in order to reduce the variance.
  • Boosting involves growing trees sequentially.
  • Each tree is grown using information from previously grown trees.
  • Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.
  • The modification takes the form of weights on the observations.
  • The weights on the observations indicate their importance.

Entropy and Information Gain 2

  • Entropy is a measure of uncertainty associated with a random variable or the impurity of an arbitrary collection of examples.
  • Information gain is the expected reduction in entropy caused by partitioning the examples according to a given attribute.
  • Entropy is defined as: \(Entropy(X) = -∑ p(x) . log_2 p(x)\) where p(x) is the probability of the occurrence of event x.
  • Each branch in the tree has its own entropy.
  • Information gain is the difference between the entropy of the parent node and the weighted average of the entropy of the child nodes.
  • Rocchio Method:
    • Similar to the k-nearest neighbor algorithm, but instead of taking the majority vote of the k-nearest neighbors, it takes the average of the k-nearest neighbors.
    • The average is the centroid of the k-nearest neighbors.

References


  1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning with applications in R. New York, NY: Springer. Read Chapter 8: Tree-Based Methods, p. 303-335. 

  2. Review the presentation which details how to compute entropy, information gain as part of an algorithm to ‘learn’ a decision tree. Available from http://www.math.unipd.it/~aiolli/corsi/0708/IR/Lez12.pdf