In the first post of this series, a function (uniqueCol()) is illustrated. The function returns all combinations of columns that can uniquely identify the records of a data frame and it is created by combining recursive functions. In this post, popular packages or a combination of packages for data manipulation are compared. They are listed below.
At first, 10000 artificial records of name, territory, start date and end date are created. As the first goal is to assign unique ids to individuals, integer ids are mapped to unique names. Then they are merged to the raw data (preDf).
A column is added using each of the packages and this new column is the number of days between the start and end dates (period). It is either unconditional, which is just obtained by individual rows or conditional, which is average periods by id.
The unconditional period can be obtained by transform() while the conditional values are obtained by aggregate() followed by being merged by merge(). Compared to other ways, it requires more lines of code if a new column depends on existing columns. However there may be cases where a 3rd party package is not available or have to be avoided. Then this will be quite beneficial.
This package is the most flexible as it supports not only data frames but also other data types such as lists, arrarys … However, as shown later, it can be too slow compared to other ways and its use may not be compelling when a new column depend on existing columns and the number of records are large.
This is a successor of the plyr package. Instead of supporting various data types, it only supports data frames (and data tables) to improve speed. It has some SQL-like functions (eg group_by()) and thus those who know SQL would find it easier. Interestingly it supports the pipe operator (%>%) that chains functions. F# also has a similar operator and people from other programming languages would find it easier than wrapping one or more functions inside a function.
This package is also popular and it seems to be the fastest. One or more columns can be indexed to further improve speed using setkey(). Its syntax seems to be more similar to the functions in the base package compared to the previous two packages. I’m not fully sure but a drawback of this package might be it doesn’t seem to provide an easy way that just adds a new column. That is exising column names should be added in a list but, if the number of columns are many, it can be tedious. There may be a way but I haven’t found and, in this case, the next way can be used for that.
dplyr + data.table
The dplyr package also supports data tables and there may be cases where using both the packages are necessary - an example case is mentioned above.
System time to obtain the conditional column is recorded and plotted below - as the plyr package takes too long compared to others, it is omitted in the plot. In this example, the data.table package takes the least amount of time followed by those that use the dplyr package. The functions in the base package are slower but not too much and many data manipulation tasks would rely on them. Finally the plyr package takes too long but it still has a reason not to be deprecated - flexibility.