In the previous posts, two groups of ways to implement parallel processing on a single machine are introduced. The first group is provided by the snow or parallel package and the functions are an extension of
lapply() (link). The second group is based on an extension of the for construct (foreach, %dopar% and %:%). The foreach construct is provided by the foreach package while clusters are made and registered by the parallel and doParallel packages respectively (link). To conclude this series, three practical examples are discussed for comparison in this article.
Let’s get started.
The following packages are loaded at first. Note that the randomForest, rpart and ISLR packages are necessary for the second and third examples and they are loaded later.
This example is from McCallum and Weston (2012). It is originally created using
clusterApply() in the snow package. Firstly a slight modification is made to be used with
parLapplyLB() in the parallel package. Also a foreach construct is created for comparison.
According to the document,
- the data given by x are clustered by the k-means method, which aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centres is minimized
At the minimum, all data points are nearest to the cluster centres. The number of centers are specified by centers (4 in this example). The distance value is kept in tot.withinss. As initial clusters are randomly assigned at the beginning, fitting is performed multiple times and it is determined by nstart.
The clusters are initialized by
clusterEvalQ() as Boston data is available in the MASS package. A list of outputs are returned by parLapplyLB() and tot.withinss is extract by
sapply(). The final outcome is what gives the minimum tot.withinss.
The corresponding implementation using the foreach package is shown below. An iterator object is created to repeat the individual nstart value for the number of clusters (iters). A funtion to combine the outcome is created (
comb()), which just keeps the outcome that gives the minimum tot.withinss - as .combine doesn’t seem to allow an argument, this kind of modification would be necessary.
This example is from foreach packages’s vignette.
According to the package document,
- randomForest implements Breiman’s random forest algorithm (based on Breiman and Cutler’s original Fortran code) for classification and regression.
x and y keep the predictors and response. A function (
rf()) is created to implement the algorithm. If data has to be sent to each worker, it can be sent either by
clusterCall() or by a function. If
clusterApplyLB() are used, the former should be used to reduce I/O operations time and it’d be alright to send by a function if
parLapplyLB() are used - single I/O for each task split. (for details, see the first article) As the randomForest package provides a function to combine the objects (
combine()), it is used in
do.call(). Finally a confusion table is created.
randomForest() is directly used in the foreach construct and the returned outcomes are combined by
combine() (.combine=”combine”). The fitting function has to be available in each worker and it is set by .packages=”randomForest”. As there are multiple argument in the combine function, it is necessary to set the multi-combine option to be TRUE (.multicombine=TRUE) - this option will be discussed further in the next example. As the above example, a confusion matrix is created at the end - both the results should be the same as the same streams of random numbers are set to be generated by
Although bagging can be implemented using the randomForest package, another quick implementation is tried using the rpart package for illustration (
cartBGG()). Specifically bootstrap samples can be created for the number of trees specified by ntree. To simplify discussion, only the variable importance values are kept - an rpart object keeps this details in variable.importance. Therefore
cartBGG() returns a list where its only element is a data frame where the number of rows is the same to the number of predictors and the number of columns is the same to the number of trees. In fact,
cartBGG() is a constructor that generates a S3 object (rpartbgg).
If the total number of trees are split into clusters (eg into 4 clusters), there will be 4 lists and it is possible to combine them. Below is an example of such a function (
comBGG()) - it just sums individual variable importance values.
- arguments shouldn’t be dertermined (…) as the number of clusters can vary (an thus the number of lists)
- a list is created that binds the variable number of arguments (
- only the elements that keeps variable importance are extracted into another list by
- the new list of variable importance values can be restructured using
- finally each row values are added by
Similar to the above example, bagged trees are generated across clusters using
cartBGG(). Then the result is combined by
This example is simplar to the random forest example. Note that .multicombine determines how many arguments are combined. For performance, the maximum number is 100 if the value is TRUE and 2 if FALSE by default (.maxcombine=if (.multicombine) 100 else 2). As there are more than 2 lists in this example, it should be set TRUE so that all lists generated by the clusters can be combined.
Three practical examples of implementing parallel processing in a single machine are discussed in this post. They are relatively easy to implement and all existing packages can be used directly. I hope this series of posts are useful.