# 2015-02-01-Tree-Based-Methods-Part-I

This is the first article about tree based methods using R. *Carseats* data in the chapter 8 lab of ISLR is used to perform classification analysis. Unlike the lab example, the **rpart** package is used to fit the CART model on the data and the **caret** package is used for tuning the pruning parameter (`cp`

).

The bold-cased sections of the tutorial are covered in this article.

- Visualizations
- Pre-Processing
**Data Splitting**- Miscellaneous Model Functions
**Model Training and Tuning**- Using Custom Models
- Variable Importance
- Feature Selection: RFE, Filters, GA, SA
- Other Functions
- Parallel Processing
- Adaptive Resampling

The pruning parameter in the **rpart** package is scaled so that its values are from 0 to 1. Specifically the formula is

where is the tree with no splits, is the number of splits for a tree and *R* is the risk.

Due to the inclusion of , when *cp=1*, the tree will result in no splits while it is not pruned when *cp=0*. On the other hand, in the original setup without the term, the pruning parameter () can range from 0 to infinity.

Letâ€™s get started.

The following packages are used.

*Carseats* data is created as following while the response (*Sales*) is converted into a binary variable.

The train and test data sets are split using `createDataPartition()`

.

Stratify sampling is performed by default.

The following resampling strategies are considered: *cross-validation*, *repeated cross-validation* and *bootstrap*.

There are two methods in the **caret** package: `rpart`

and `repart2`

. The first method allows the pruning parameter to be tuned. The tune grid is not set up explicitly and it is adjusted by `tuneLength`

- equally spaced **cp** values are created from 0 to 0.3 in the package.

Repeated cross-validation and bootstrap produce the same best tuned **cp** value while cross-validation returns a higher value.

The one from repeated cross-validation is taken to fit to the entire training data.

**Updated on Feb 10, 2015**

- As a value of
*cp*is entered in`rpart()`

, the function fits the model up to the value and takes the result. Therefore it produces a pruned tree. - If it is not set or set to be a low value (eg, 0), pruning can be done using the
`prune()`

function.

The resulting tree is shown as following. The plot shows expected losses and node probabilities in the final nodes. For example, the leftmost node has

- expected loss of 0.13 (= 8/60 (0.13333))
- node probability of 19% (= 60/321)

The fitted model is applied to the test data.

More models are going to be implemented/compared in the subsequent articles.