# 2017-05-04-CRAN-Package-Recommender-Part-I

One of the attractions of R is 10,000+ contrubuted packages that can be easily downloaded from the Comprehensive R Archive Network (CRAN). Due to the ever-growing number of packages, however, it is not easy to find interesting packages. CRAN Task Views is a good source but, at the time of wrting, it covers only about 2,700 packages. There are some websites that help search R related information such as RDocumentation, Rseek or rdrr-io. However they seem to rely on *full-text search* so that extra effort is required to identify packages of interest. In this regard, it can be quite useful if there is a *recommender system* so that a set of packages are recommended by a *package name* or general *keywords*.

Collaborative filtering is one of the most popular techniques used by recommender systems. However it is not applicable due to the anonymized structure of CRAN package download logs provided by RStudio. On the other hand, association rules may be mined so that those rules can be used to recommend a package when a user enters a package name. Also package documents (eg package description) can be searched given keywords (eg by fuzzy text search) and top packages that are ranked high according to the transactions data may be shown for recommendation.

This is the second series about *serverless data product development*. In this post, a link analysis algorithm called *Hyperlink-Induced Topic Search (HITS)* is introduced. This algorithm helps assign weights on individual transactions and those weights can be used for weighted association rule mining. In the current context, more relevant packages may be recommended by weighted association rules when a package name is entered. Also *HITS* allows to have weights on individual items (packages) so that they can be used to show recommended packages with certain keywords.

The initial series plan is listed below.

Development

- Introduction to HITS and weighted association rules mining - this post
- Downloading/Processing relevant data
- Analysing CRAN package download logs
- Identifying package similarity
*to be updated*

Deployment

*to be updated*

The following packages are used.

## Hyperlink-Induced Topic Search (HITS)

According to Wikipedia,

Hyperlink-Induced Topic Search (HITS; also known as hubs and authorities) is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users direct to other authoritative pages. In other words, a good

hubrepresented a page that pointed to many other pages, and a goodauthorityrepresented a page that was linked by many different hubs.

As Sun and Bai (2008) discusses, *HITS* can be applied to analysis of transactions database. The key idea is

- transaction database can be represented as a bipartite graph and thus
- a link-based ranking model (i.e.
*HITS*) can be applied to analysis of transactions.

#### Example

It can be a lot easier to illustrate with an example. Below shows simple transaction data - it can also be found in the *arules* package (`data("SunBai")`

).

As can be seen above, transactions 200 and 500 have common items of C, F and G, which implies that a strong cross-selling effect exists among them. Although they should be evaluated high, it may not be captured enough if counting-based measurement is employed rather than link-based measurement. On the other hand, although item A has the highest support, it doesn’t appear with the valuable items so that it should be evaluated lower. In this circumstance, transactions analysis can be improved by assigning weights on individual transactions and *HITS* provides an algorithmic way of doing so.

#### Graph Representation

The transactions data can be constructed as a bipartite graph as shown below. Note the graph is a directed graph where edges are directed from transaction to individual items.

Adjacency matrix is a square matrix that is used to represent a finite graph. The adjacency matrix of the graph is shown below. In row 1, columns A, B, C, D and E are 1 as transaction 100 includes these items. Only the upper right-hand side of the matrix can have 0 or 1 as the edges are directed only from transaction to items - transaction to transaction and items to items are not possible.

#### Implementation

Practical Graph Mining With R covers *HITS* algorithm in a comprehensive as well as practical way. In Ch 5, *authority* and *hub* are defined as following.

- Authority - A vertex is considered an authority if it has many pages that link to it (i.e., it has a high indegree).
- Hub - A vertex is considered a hub if it points to many other vertices (i.e., it has a high outdegree).

For transactions database, individual items are candidates of authority while transactions are hub candidates.

In *HITS*, authority and hub scores for each vertex are updated iteratively as following.

- Initialize authority () and hub () vector scores
- Iteratively update scores
- Let be adjacency matrix
- Update and

- Normalize scores
- Let be Euclidean norm
- Normalize and

- Run until a convergent criterion is met

Note that the implementation is based on the mathematical definition. Note further that the *igraph* or Python’s NetworkX packages provide their own implementations but their outputs don’t seem to be directly applicable - the *arules* has *hits()* function that returns hub scores of a transactoins object but the values are quite different. `run_hits()`

executes *HITS* while `get_hits()`

collects relevant scores.

In order to obtain *authority* and *hub* scores, adjacency matrix is necessary. The *transactions* class of the *arules* package has the *data* slot and it is the upper right-hand side of adjacency matrix. Although it is possilbe to get complete adjacency matrix from the *igraph* package, due to the structure of the matrix, it can be simply obtained by creating the upper left-hand side and bottom matrices of 0’s and binding those to the *data* matrix. `get_adj()`

returns the adjacency matrix of a transactions object. Note that sparse matrices are created from the *Matrix* package so as to prevent potential *integer overflow error*.

The adjacency matrix of the transaction object can be obtained as following.

With this matrix, it is possible to obtain authority and hub scores. As mentioned, the *arules* package has `hits()`

that returns hub scores of a transaction object. It is shown that the hub scores match reasonably. (It is not necessary to obtain hub scores manually but authority scores. Therefore having a reliable function is necessary.)

The authority scores of C and G are higher than that of A although the support of A is higher. The scores can also be useful for the recommender system.

## Weighted Association Rules

In order to compare association rules with/without weights, the hub scores are added to *transactionInfo*.

The weighted support of an item is defined as the sum of weights where the item appears divided by the sum of all weights. For example, the weighted support of A is 0.5719366, which can be obtained by

`sum(info$weight[grepl('A', info$items)])/sum(info$weight)`

.

The weighted supports of items C and G are higher than that of A although their counting-based measurements are lower. With this link-based measurement, the resulting association rules may be more useful for the recommender system.

As expected, items C, F and G and their combinations are given more importance.

The resulting association rules are shown below.

In this post, *HITS* algorithm is introduced and its potential application to a recommender system. In the following posts, CRAN package download logs will be analysed.