A short while ago I had a chance to perform analysis using the caret package. One of the requirements is to run it parallelly and to work in both Windows and Linux. The requirement can be met by using the parallel and doParallel packages as the caret package trains a model using the foreach package if clusters are registered by the doParallel package - further details about how to implement parallel processing on a single machine can be found in earlier posts (Link 1, Link 2 and Link 3). While it is relatively straightforward to train a model across multiple clusters using the caret package, setting up random seeds may be a bit tricky. As analysis can be more reproducible by random seeds, a way of setting them up is illustrated using a simple function in this post.
The following packages are used.
In the caret package, random seeds are set up by adjusting the argument of seeds in trainControl() and the object document illustrates it as following.
seeds - an optional set of integers that will be used to set the seed at each resampling iteration. This is useful when the models are run in parallel. A value of NA will stop the seed from being set within the worker processes while a value of NULL will set the seeds using a random set of integers. Alternatively, a list can be used. The list should have B+1 elements where B is the number of resamples. The first B elements of the list should be vectors of integers of length M where M is the number of models being evaluated. The last element of the list only needs to be a single integer (for the final model). See the Examples section below and the Details section.
Setting seeds to either NA or NULL wouldn’t guarantee a full control of resampling so that a custom list would be necessary. Here setSeeds() creats the custom list and it handles only (repeated) cross validation and it returns NA if a different resampling method is specified - this function is based on the source code. Specifically B is determined by the number of folds (numbers) or the number of repeats of it (numbers x repeats). Then a list of B elements are generated where each element is an integer vector of length M. M is the sum of the number of folds (numbers) and the length of the tune grid (tunes). Finally an integer vector of length 1 is added to the list.
Below shows the control variables of the resampling methods used in this post: k-fold cross validation and repeated k-fold cross validation. Here (5 repeats of) 3-fold cross validation is chosen. Also a grid is set up to tune mtry of randomForest() (cvTunes) and rcvTunes is for tuning the number of nearest neighbours of knn().
Random seeds for 3-fold cross validation are shown below - B + 1 and M equal to 4 (3 + 1) and 7 (3 + 4) respectively.
Random seeds for 5 repeats of 3-fold cross validation are shown below - B + 1 and M equal to 16 (3 x 5 + 1) and 15 (3 + 12) respectively.
Given the random seeds, train controls are set up as shown below.
They are tested by comparing the two learners: knn and randomForest. Each of the two sets of objects are the same except for times elements, which are not related to model reproduciblility.
I hope this post may be helpful to improve reproducibility of analysis using the caret package.