One of the attractions of R is 10,000+ contrubuted packages that can be easily downloaded from the Comprehensive R Archive Network (CRAN). Due to the ever-growing number of packages, however, it is not easy to find interesting packages. CRAN Task Views is a good source but, at the time of wrting, it covers only about 2,700 packages. There are some websites that help search R related information such as RDocumentation, Rseek or rdrr-io. However they seem to rely on full-text search so that extra effort is required to identify packages of interest. In this regard, it can be quite useful if there is a recommender system so that a set of packages are recommended by a package name or general keywords.
Collaborative filtering is one of the most popular techniques used by recommender systems. However it is not applicable due to the anonymized structure of CRAN package download logs provided by RStudio. On the other hand, association rules may be mined so that those rules can be used to recommend a package when a user enters a package name. Also package documents (eg package description) can be searched given keywords (eg by fuzzy text search) and top packages that are ranked high according to the transactions data may be shown for recommendation.
This is the second series about serverless data product development. In this post, a link analysis algorithm called Hyperlink-Induced Topic Search (HITS) is introduced. This algorithm helps assign weights on individual transactions and those weights can be used for weighted association rule mining. In the current context, more relevant packages may be recommended by weighted association rules when a package name is entered. Also HITS allows to have weights on individual items (packages) so that they can be used to show recommended packages with certain keywords.
The initial series plan is listed below.
- Introduction to HITS and weighted association rules mining - this post
- Downloading/Processing relevant data
- Analysing CRAN package download logs
- Identifying package similarity
- to be updated
- to be updated
The following packages are used.
Hyperlink-Induced Topic Search (HITS)
According to Wikipedia,
Hyperlink-Induced Topic Search (HITS; also known as hubs and authorities) is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users direct to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs.
As Sun and Bai (2008) discusses, HITS can be applied to analysis of transactions database. The key idea is
- transaction database can be represented as a bipartite graph and thus
- a link-based ranking model (i.e. HITS) can be applied to analysis of transactions.
It can be a lot easier to illustrate with an example. Below shows simple transaction data - it can also be found in the arules package (
As can be seen above, transactions 200 and 500 have common items of C, F and G, which implies that a strong cross-selling effect exists among them. Although they should be evaluated high, it may not be captured enough if counting-based measurement is employed rather than link-based measurement. On the other hand, although item A has the highest support, it doesn’t appear with the valuable items so that it should be evaluated lower. In this circumstance, transactions analysis can be improved by assigning weights on individual transactions and HITS provides an algorithmic way of doing so.
The transactions data can be constructed as a bipartite graph as shown below. Note the graph is a directed graph where edges are directed from transaction to individual items.
Adjacency matrix is a square matrix that is used to represent a finite graph. The adjacency matrix of the graph is shown below. In row 1, columns A, B, C, D and E are 1 as transaction 100 includes these items. Only the upper right-hand side of the matrix can have 0 or 1 as the edges are directed only from transaction to items - transaction to transaction and items to items are not possible.
Practical Graph Mining With R covers HITS algorithm in a comprehensive as well as practical way. In Ch 5, authority and hub are defined as following.
- Authority - A vertex is considered an authority if it has many pages that link to it (i.e., it has a high indegree).
- Hub - A vertex is considered a hub if it points to many other vertices (i.e., it has a high outdegree).
For transactions database, individual items are candidates of authority while transactions are hub candidates.
In HITS, authority and hub scores for each vertex are updated iteratively as following.
- Initialize authority () and hub () vector scores
- Iteratively update scores
- Let be adjacency matrix
- Update and
- Normalize scores
- Let be Euclidean norm
- Normalize and
- Run until a convergent criterion is met
Note that the implementation is based on the mathematical definition. Note further that the igraph or Python’s NetworkX packages provide their own implementations but their outputs don’t seem to be directly applicable - the arules has hits() function that returns hub scores of a transactoins object but the values are quite different.
run_hits() executes HITS while
get_hits() collects relevant scores.
In order to obtain authority and hub scores, adjacency matrix is necessary. The transactions class of the arules package has the data slot and it is the upper right-hand side of adjacency matrix. Although it is possilbe to get complete adjacency matrix from the igraph package, due to the structure of the matrix, it can be simply obtained by creating the upper left-hand side and bottom matrices of 0’s and binding those to the data matrix.
get_adj() returns the adjacency matrix of a transactions object. Note that sparse matrices are created from the Matrix package so as to prevent potential integer overflow error.
The adjacency matrix of the transaction object can be obtained as following.
With this matrix, it is possible to obtain authority and hub scores. As mentioned, the arules package has
hits() that returns hub scores of a transaction object. It is shown that the hub scores match reasonably. (It is not necessary to obtain hub scores manually but authority scores. Therefore having a reliable function is necessary.)
The authority scores of C and G are higher than that of A although the support of A is higher. The scores can also be useful for the recommender system.
Weighted Association Rules
In order to compare association rules with/without weights, the hub scores are added to transactionInfo.
The weighted support of an item is defined as the sum of weights where the item appears divided by the sum of all weights. For example, the weighted support of A is 0.5719366, which can be obtained by
The weighted supports of items C and G are higher than that of A although their counting-based measurements are lower. With this link-based measurement, the resulting association rules may be more useful for the recommender system.
As expected, items C, F and G and their combinations are given more importance.
The resulting association rules are shown below.
In this post, HITS algorithm is introduced and its potential application to a recommender system. In the following posts, CRAN package download logs will be analysed.