# 2015-03-03-2nd-Trial-of-Turning-Analysis-into-S3-Object

In order to avoid an error that could be caused by conflicting variable names and to keep variables in a more effective way, a trial of turning analysis into a S3 object is made in a (previous article). The second trial is made recently and a class (*rpartDT*) that extends *rpartExt* is introduced in this article. In line with the first trial, the base class keeps key outcomes of the CART model in a list and the extended class includes outcomes of bagged trees as well as those of the base class. As the main purpose of this class is to evaluate performance of an individual tree, its bagging implementation is a bit different from the conventional one. Specifically, while unpruned trees are fit recursively in the conventional bagging so that bias-variance trade-off could be improved mainly due to lowered variance, it performs with the *cp* values set at the lowest xerror and by the 1-SE rule. Also other control variables are untouched (eg *minbucket* is 20 at default).

The class is constructed so as to produce the following outcomes.

**Error (mean misclassification error or root mean squared error)**

- Error distribution of each bagged tree (individual error)
- As Hastie et al.(2008) illustrates, non-parametric bootstrap is a computer implementation of non-parametric maximum likelyhood estimation and Bayesian analysis with non-informative prior. Therefore it would be helpful to see the location of a single tree’s error in the distribution of individual bagged trees.

- Majority vote or average (cumulative error)
- By averaging overfit and thus unstable outcomes of a single tree, bagging could provide better results and comparison between them would be necessary.

- Out-of-bag (oob) error
- With sampling with replacement, the probability that a record is not taken is and that of n records is as n goes to infinity (about 36.8% of records are not taken). These records can be used to produce errors if there is no test data available.

- Test error (if exists)
- It would be even better if a single tree’s error is compared to that of independent test data.

**Variable importance**

- This is to see if a single tree’s variable importance is far different from that of bagged trees.
- The
**rpart**package provides variable importance and it’d be helpful if cumulative variable importance is used for comparison.

Before getting started, note that the source of the classes can be found in this gist and, together with the relevant packages (see *tags*), it requires a utility function (`bestParam()`

) that can be found here.

The bootstrap samples are created using the **mlr** package (`makeResampleInstance()`

). Note that a sample is discarded if the *cp* values are not obtained - *cnt* is added by 1 only if the sum of *cp* values is not 0 where 0 is assigned as *cp* values when an error is encountered (see the *tryCatch* block).

Data is split as usual.

The class is instantiated after importing the constructors.

The naming rule is shown below.

- rpt - single tree
- lst -
*cp*at the least*xerror* - se -
*cp*by the*1-SE Rule* - oob (test) - out-of-bag (test) data
- ind (cum) - individual (cumulative) values

The summary of the *cp* values of the bagged trees are shown below, followed by the single tree’s *cp* value at the least *xerror*.

Selective individual and cumulative fitted values are shown below. Don’t be confused with the first column as it is the response values of the entire training data - each fitted value column has its own sample number.

Given a data frame of fitted values of individual trees (*fit*), the fitted values are averaged depending on the class of the respose - majority vote if *factor* or average if *numeric*. For a *numeric* response, `average()`

is applied in an expanding way column-wise while a vectorized function (`retVote()`

) is created for a *factor* response with the following rules in order.

- if no fitted value (table length = 0), assign
*NA* - if a single fitted value (table length = 1), assign
*name*of it - if there is a tie, assgin
*NA* - finally take the
*name*of the level that occupies most

Note that, as the first column has response values, it is excluded and, although `retCum()`

updates values in a way that is vectorized row-wise, it has to be ‘for-looped’ column-wise. By far this part is the biggest bottleneck and it should be enhanced in the future - the way how *factor* response variables are updated makes it longer to perform classification tasks.

Given a data frame of fitted values (*fit*), *mmce* or *rmse* are obtained depending on the class of the response. Note that, as the first column has response values, it is excluded.

Selective individual and cumulative errors of bagged trees are shown below.

Importance of each variable of the single and bagged trees is found below.

In the next two articles, the CART analysis will be evaluated using the same data as regression and classification tasks.