In order to avoid an error that could be caused by conflicting variable names and to keep variables in a more effective way, a trial of turning analysis into a S3 object is made in a (previous article). The second trial is made recently and a class (rpartDT) that extends rpartExt is introduced in this article. In line with the first trial, the base class keeps key outcomes of the CART model in a list and the extended class includes outcomes of bagged trees as well as those of the base class. As the main purpose of this class is to evaluate performance of an individual tree, its bagging implementation is a bit different from the conventional one. Specifically, while unpruned trees are fit recursively in the conventional bagging so that bias-variance trade-off could be improved mainly due to lowered variance, it performs with the cp values set at the lowest xerror and by the 1-SE rule. Also other control variables are untouched (eg minbucket is 20 at default).

The class is constructed so as to produce the following outcomes.

Error (mean misclassification error or root mean squared error)

• Error distribution of each bagged tree (individual error)
• As Hastie et al.(2008) illustrates, non-parametric bootstrap is a computer implementation of non-parametric maximum likelyhood estimation and Bayesian analysis with non-informative prior. Therefore it would be helpful to see the location of a single tree’s error in the distribution of individual bagged trees.
• Majority vote or average (cumulative error)
• By averaging overfit and thus unstable outcomes of a single tree, bagging could provide better results and comparison between them would be necessary.
• Out-of-bag (oob) error
• With sampling with replacement, the probability that a record is not taken is $\left(1-\frac{1}{n}\right)$ and that of n records is $\left(1-\frac{1}{n}\right)^n = e^{-1}$ as n goes to infinity (about 36.8% of records are not taken). These records can be used to produce errors if there is no test data available.
• Test error (if exists)
• It would be even better if a single tree’s error is compared to that of independent test data.

Variable importance

• This is to see if a single tree’s variable importance is far different from that of bagged trees.
• The rpart package provides variable importance and it’d be helpful if cumulative variable importance is used for comparison.

Before getting started, note that the source of the classes can be found in this gist and, together with the relevant packages (see tags), it requires a utility function (bestParam()) that can be found here.

The bootstrap samples are created using the mlr package (makeResampleInstance()). Note that a sample is discarded if the cp values are not obtained - cnt is added by 1 only if the sum of cp values is not 0 where 0 is assigned as cp values when an error is encountered (see the tryCatch block).

Data is split as usual.

The class is instantiated after importing the constructors.

The naming rule is shown below.

• rpt - single tree
• lst - cp at the least xerror
• se - cp by the 1-SE Rule
• oob (test) - out-of-bag (test) data
• ind (cum) - individual (cumulative) values

The summary of the cp values of the bagged trees are shown below, followed by the single tree’s cp value at the least xerror.

Selective individual and cumulative fitted values are shown below. Don’t be confused with the first column as it is the response values of the entire training data - each fitted value column has its own sample number.

Given a data frame of fitted values of individual trees (fit), the fitted values are averaged depending on the class of the respose - majority vote if factor or average if numeric. For a numeric response, average() is applied in an expanding way column-wise while a vectorized function (retVote()) is created for a factor response with the following rules in order.

1. if no fitted value (table length = 0), assign NA
2. if a single fitted value (table length = 1), assign name of it
3. if there is a tie, assgin NA
4. finally take the name of the level that occupies most

Note that, as the first column has response values, it is excluded and, although retCum() updates values in a way that is vectorized row-wise, it has to be ‘for-looped’ column-wise. By far this part is the biggest bottleneck and it should be enhanced in the future - the way how factor response variables are updated makes it longer to perform classification tasks.

Given a data frame of fitted values (fit), mmce or rmse are obtained depending on the class of the response. Note that, as the first column has response values, it is excluded.

Selective individual and cumulative errors of bagged trees are shown below.

Importance of each variable of the single and bagged trees is found below.

In the next two articles, the CART analysis will be evaluated using the same data as regression and classification tasks.