This is the first article about tree based methods using R. Carseats data in the chapter 8 lab of ISLR is used to perform classification analysis. Unlike the lab example, the rpart package is used to fit the CART model on the data and the caret package is used for tuning the pruning parameter (cp).
The bold-cased sections of the tutorial are covered in this article.
Miscellaneous Model Functions
Model Training and Tuning
Using Custom Models
Feature Selection: RFE, Filters, GA, SA
The pruning parameter in the rpart package is scaled so that its values are from 0 to 1. Specifically the formula is
where is the tree with no splits, is the number of splits for a tree and R is the risk.
Due to the inclusion of , when cp=1, the tree will result in no splits while it is not pruned when cp=0. On the other hand, in the original setup without the term, the pruning parameter () can range from 0 to infinity.
Let’s get started.
The following packages are used.
Carseats data is created as following while the response (Sales) is converted into a binary variable.
The train and test data sets are split using createDataPartition().
Stratify sampling is performed by default.
The following resampling strategies are considered: cross-validation, repeated cross-validation and bootstrap.
There are two methods in the caret package: rpart and repart2. The first method allows the pruning parameter to be tuned. The tune grid is not set up explicitly and it is adjusted by tuneLength - equally spaced cp values are created from 0 to 0.3 in the package.
Repeated cross-validation and bootstrap produce the same best tuned cp value while cross-validation returns a higher value.
The one from repeated cross-validation is taken to fit to the entire training data.
Updated on Feb 10, 2015
As a value of cp is entered in rpart(), the function fits the model up to the value and takes the result. Therefore it produces a pruned tree.
If it is not set or set to be a low value (eg, 0), pruning can be done using the prune() function.
The resulting tree is shown as following. The plot shows expected losses and node probabilities in the final nodes. For example, the leftmost node has
expected loss of 0.13 (= 8/60 (0.13333))
node probability of 19% (= 60/321)
The fitted model is applied to the test data.
More models are going to be implemented/compared in the subsequent articles.