In the previous post, it was discussed to apply Hyperlink-Induced Topic Search (HITS) to association rules mining for creating a CARN package recommender. The link analysis algorithm gives more weights on transactions where strong cross-selling effects exist so that more relevant association rules can be mined for recommendation. Not all packages, however, are likely to be included in those rules and it is necessary to have a way to complement it. Kaminskas et al. (2015) discusses a recommender system for small retailers. In the paper, a combination of association rules and text-based similarity are utilized, which can be a good fit for the CRAN recommender. Before actual development, relevant data has to be downloaded/processed and it is the topic of this post.
Here is the snippet for initialization. utils.R can be found here.
Some functions take quite long and they are executed in parallel. For example, it’ll be a lot quicker if files are read in multiple nodes and combined later. A unified function that executes a function in parallel is created and named as process(). In the following example, read_files() reads/binds multiple files, returning a data frame. This function can be executed in multiple nodes with process().
An individual function has the common items argument that can be files to read/download or package names to scrap. Each function can have a different set of arguments and they are captured in ... argument. get_args() is just for convenience to grap a specific argument in ....
process() runs an individual function itself if cores equals to 1. When cores is greather than 1, items are split into the number of cores and the function is executed in multiple nodes. Each function may need a specific initialization (eg library(readr)) and it is captured in init_str. Note that clusterEvalQ() accepts an expression, which is not evaluated. An expression is set as a string (init_str) and exported in an environment that is created by set_env() - see clusterExport(). In this way, init_str, if exists, can be evaluated in each node. Finally results are combined by a function set by combine. (See this post to see how parLapplyLB() works.)
Downloading CRAN log
The log files from 2017-04-01 to 2017-04-30 are downloaded from this page. As discussed earlier, they are downloaded in parallel by download_log() wrapped in process().
The log data is anonymized and transactions have to be identified. date and ip_id are not enough as the following records indicate different r_version, r_arch and r_os with the same date and ip_id. Also some records have quite small size (eg 512) and they’d need to be filtered out.
The data is filtered and grouped by date, ip_id, r_version, r_arch and r_os followed by adding the number of packages downloded in each group (count). Intially 31,777,687 records are found and the number goes down to 22,056,121 after filtering.
Transactions can be identified from the filtered data as following.
More than 50% of transactions download a single package and up to 1231 packages are found in a transaction. It is unrealistic that a user downloads such a large number of packages and the maximum number of packages is set to be 20.
A total of 6,215,863 transactions are identified and the proportion of transactions by downloaded packages are shown below.
It requires multiple steps to construct a transactions object of the arules package from the log data.
filter_log() - Data is filtered and grouped by date, ip_id, r_version, r_arch and r_os followed by adding count. If max_download is not NULL, data is further filtered by this number.
add_group_idx() - Each group is given a unique id and the id column is added to data.
keep_trans_cols() - Transaction ids are made up of date and (group) id. Only transaction id, package name and count columns are kept.
split_log() - The previous 3 functions are executed in order and data is split after assigning split group number (splt). See below for details.
construct_trans() - A transaction object is made from a matrix (as(mat, 'transactions')) and the matrix is created by dcast() of the data.table package, which returns 0 or 1 elements. Note that the entire log data for even a single day can cause an error in dcast() so that it is split by groups (splt) and transaction objects are constructed for each group. Transaction objects can efficiently be merged as discussed below.
Transaction objects are saved from individual log data files as shown below. Note that dcast() consumes quite a large amount of memory and process() is not recommended if the machine doesn’t have enough memory.
An example of transaction objects is shown below.
The arules package has a function to merge transactions (merge()). However it doesn’t allow to merge transactions that have different number of items. bind_trans() is created to overcome this limitation, which accepts multiple transaction objects. First it collects information of transaction objects and all unique items are obtained across those objects. Then, for each of the transaction objects, a sparse matrix is created for the items that don’t exist (get_sm()) and row binded to the item matrix. Note that the last element is manually set to be FALSE where it is set as TRUE by default. Finally individual item matrices are column binded, followed by returning a merged transaction object.
An example is shown below. Separate transaction objects are created and merged. The merged object is compared to the original transactions object and they match the same.
The entire transaction objects are merged and verified below. As can be seen, the total number of transactions are the same. Note that more than 50% of transactions have only a single package and those transaction records would need to be removed for association rules mining. On the other hand, the entire transactions records would need to execute HITS so that both the objects are necessary for following analysis. (Remind that authority will be used for recommendation by keywords.)
Collecting Package Information
As indicated earlier, text-based similarity can be used to complement association rules. get_package_info() can be used within process() to collect relevant information.
An example of collecting package information is shown below.
The is all for this post. In the following post, the transaction data will be analysed.