When I imagine a workflow, it is performing the same or similar tasks regularly (daily or weekly) in an automated way. Although those tasks can be executed in a script or a source()d script, it may not be easy to maintain separate scripts while the size of tasks gets bigger or if they have to be executed in different machines. In academia, reproducible research shares similar ideas but the level of reproducibility introduced in Gandrud, 2013 may not suffice in a business environment as the focus is documenting in a reproducible way. A R package, however, can be an effective tool and it can be considered as a portable class library in C# or Java. Like a class library, it can include a set of necessary tasks (usually using functions) and, being portable, its dependency can be managed well - for example, it is possible to set so that dependent packages can also be installed if some of them are not installed already. Moreover the benefit of creating a R package would be significant if it has to be deployed in a production server as it’d be a lot easier to convince system admin with the built-in unit tests, object documents and package vignettes. In this article an example of creating a R package is illustrated.
Introduction to the treebgg package
This package extends the CART model by the rpart package (
cartmore()) and implements bagging both sequentially (
cartbgg()) and in parallel (
cartbggs()). For sequantial bagging implementation, it returns the number of trees, variable importance, out-of-bag/test response and out-of-bag/test prediction of each tree in a list. The outputs of the sequential implementation are combined and it is performed so as to obtain variable importance measures and errors of bagged trees.
Steps to create the treebgg package
The treebgg package is created, following Wickham, 2015 - an overview of package development can also be checked in the Jeff Leek’s repo. With RStudio and several packages in Step 1, it is quite straightforward as long as something to be included is clear. Specific steps are listed below.
- Install necessary packages in Getting Started section of Wickham, 2015.
install.packages(c("devtools", "roxygen2", "testthat", "knitr"))
- Create a R package in a new folder via R Studio (link).
- create README.md - just a new text file where the file extension is md.
- Create a GitHub repo with the same name (treebgg)
- empty repo without initialization with README.md
- Push into the remote GitHub repository
git add *
git commit -a -m "initial commit"
git remote add origin https://github.com/jaehyeon-kim/treebgg.git
git push -u origin master
git remote add origin email@example.com:jaehyeon-kim/treebgg.git
- Otherwise the following error occurs
- error: The requested URL returned error: 403 Forbidden while accessing https://github.com/jaehyeon-kim/treebgg.git/info/refs
- Or Modify directly
- Open .git/confit and update url
- HTTP: url=https://github.com/jaehyeon-kim/treebgg.git
- SSH: firstname.lastname@example.org:jaehyeon-kim/treebgg.git
- Update R code, package meta data and object documents.
- Complete unit tests using the testthat package.
- /tests is created and testing files of 5 constructors/functions are created in /tests/testthat
- Build –> Test Package or Ctrl + Shift + T
- Create a vignette document using the knitr package.
- /vignettes created with a RMD file named above and DESCRIPTION updated to suggest the knitr package, VignetteBuilder set to knitr
The devtools package should be installed as the package exists in a GitHub repository only. For Windows users, Rtools has to be installed to build from source.
The package can be installed and loaded as following. Note that the following packages will be installed if they are not installed already: rpart, foreach, doParallel, iterators.
To extend the CART model, a S3 object is instantiated (cartmore) by
cartmore(), which fits the model at the least xerror or by the 1-SE rule - both classification and regression trees can be extended. The signature of this constructor is show below.
cartmore(formula, trainData, testData = NULL)
The object has 4 groups of elements. train/test means train/test data sets while lst/se means the cp values at the least xerror and by the 1-SE rule. The train elements keeps the model (rpart object), complexity-related values (cp, xerror and xstd), fitted values (fitted), variable importance measure (var.imp) and error (mean misclassification error or root mean squared error) (error). The test elements only hold the fitted values and error if a data set is specified and NULL if not.
Sequential bagging implementation
For sequantial bagging implementation,
cartbgg() instantiates the cartbgg object and it returns a list of the number of trees (ntree), variable importance (var.imp), out-of-bag/test response (oob.res and test.res) and out-of-bag/test prediction of each tree (oob.pred and test.pred). The signature of
cartbgg() is shown below.
cartbgg(formula, trainData, testData = NULL, ntree = 1L)
As can be seen in the signature, it has one extra argument that specifies the number of trees to generate - note that the type is restricted to be integer so that L should be followed by a numeric value.
The class and elements can be checked by
names(). This object keeps the variable importance, response and prediction values in a data frame for ease of merging. In fact, values from each tree are
merge()d. If testData is not specified, NULL is returned.
Parallel bagging implementation
cartbggs() combines the cartbgg objects in parallel, instantiating the cartbggs object. The parallel processing is performed by utilizing the following packages: parallel, foreach, doParallel and iterators. The signature of
cartbggs() is shown below.
cartbggs(formula, trainData, testData = NULL, eachTree = 1L, ncl = NULL, seed = NULL)
It has three extra arguments - eachTree, ncl and seed. Rather than specifying the total number of trees to generate (ntree) in
cartbgg(), it requires the number of trees to generate in each cluster (eachTree). The number of clusters (ncl) can be either specified or left out - if the number is left out, it is detected. The last arguments is for specifying a random seed. Note that the type of these argumens is restricted to be integer so that L should be followed by a numeric value.
The ntree value is the total number of trees and it is eachTree multiplied by ncl. The bagged trees’ variable importance measure is converted as percentage so that its sum is 100. Only the out-of-bag/test errors of bagged trees are returned and the test error is NULL if a data set is not entered in testData.
Roughly there would be two ways of performing a task. One is easy to start but hard to maintain and the other is hard to start but easy to maintain. Developing a R package should require more time to start but its benefit will be ongoing with little or no maintenance. Depending on complexity of a task, it’d be considerable to turning analysis into a package.