How to improve speed of R code by replacing complex and slow plyr steps with data.table or dplyr?

Question

I've been learning R by searching how others have done things on Stackoverflow and as a result, I've become acquainted with plyr syntax. I have the following four plyr calls with ddply that are the rate limiting steps to my code. My data approaches the hundreds of thousands to millions of records and my code mostly flies thanks to data.table and is only limited by four rate limiting but critical plyr steps. I would like to replace these with dplyr or data.table, but I've been struggling to replicate the syntax and would appreciate any help.

1. mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize, spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(cat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))

2. webuniq_test <- ddply(webuniq, c("df.activ.id"),summarise, strRM = paste(replicate(RMCount, "RM"), collapse = ","))

3. webactiv2 <- ddply(webactiv, .(VisitorID), summarize, VisitorPath = paste(Path, collapse = ","), RMpath = paste(strRM, collapse = ","), ConvTot=sum(Conv), Conv2Tot=sum(Conv2), Cov3Tot=sum(Conv3)) #check that nrow dec

4. MeltForSO3 <- ddply(MeltForSO2, c("VisitorID","ID"),summarise, SplitThis = paste(value, collapse = ","))

For (1) here is the benchmark:

#user  system elapsed 
#378.463   3.136 383.786

Here's what I'm trying to accomplish in these steps (they are similar):

They involve the aggregation of data by an ID field or ID field
Aggregation of granular character fields occurs with paste and collapse. For example, a field might be a driver's stops as he drops off packages where a "stops" field would have values 'a', 'b', 'c' for each stop. Plyr with stops_path = paste(stops, collapse = ",") will give aggregate those steps to a single line as "a,b,c"
Numeric data is sometimes summed up in the same aggregation step, for example ConvTot=sum(Conv)

I've been unsuccessful in my attempts to replicate this with either dplyr or data.table.

Is there an advantage to using one over the other for these types of aggregations? I've taken a look at this and it seems that data.table might be better for my very simple use case because of the cleaner syntax: data.table vs dplyr: can one do something well the other can't or does poorly?

Here's my unsuccessful attempt to replicate (1) above with data.table:

setkey(setDT(mergeddf2),df.activ.id, MarketingChannel)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ",")), by=list(df.activ.id,Channel)]

This threw an error: unused argument (by = list(df.activ.id, Channel))I wrote that from code off I researched on SO on how to incorporate paste into data.table. I took out the by argument just to see what would happen, and got another error with the line below:

mergeddf3test <- mergeddf2[, list(spotsids = paste(spotID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))]

The error was "Error in paste(spotID, collapse = ",") : object 'spotID' not found" which is weird because that field is definitely in the data. I thought this data.table line would properly aggregate the data with the by fields (df.activ.id and Channel) and combine the character fields as with the (a,b,c) example above.

Clearly I'll need to properly learn the syntax of either dplyr or data.table given the scale of data I'm working with so I've signed up for the data camp class for both packages. Still, I would appreciate any help on how to approach this in the immediate term.

Thanks!

When asking for help you should provide a reproducible example with sample input and desired output. If speed is an issue, be sure to include benchmarks you've performed so we can compare alternative solutions. Also, it's better to describe in words what you are trying to do rather than just asking for code to be translated from one package to another. — MrFlick
Thanks for the comment. You're right, I didn't give a reproducible sample (I normally do). I didn't because I figured that it was unnecessary since I'm just trying to figure out how to use paste and sum with another package. I do think I did a good job of explaining with words what I've tried to do and provided code for my un-successful attempts. I will update the post with the benchmarks. — leaRningR909
The data camp classes are a great place to start. Without a reproducible example my best guess is: for some reason, setDT is not actually setting as a data.table. You could check with str(mergeddf2). I can't imagine any other reason for by and column names like spotID not being found. Personally, I think everything before "Here's my attempt" in your question could go -- you should stick to a single programming problem on this site, as that's what works best for the Q&A format. — Frank

shadow shadow · Accepted Answer · 2015-05-21T08:46:36

Your replication with data.table works for me (except that channel is capitalized). Below are my attempts to replicate the first step of your list with dplyr and with data.table.

# required packages
require(plyr)
require(dplyr)
require(data.table)

sample data

mergeddf2 <- data.frame(df.activ.id = 1:5, 
                        channel = 1:8, 
                        mainID = 1:40, 
                        DateTime = Sys.Date() - 80:1, 
                        cat = letters[1:6], 
                        effrespflag = rnorm(240), 
                        othervar = 1, 
                        MarketingChannel = 2)

plyr solution

mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize, 
                   spotsids = paste(mainID, collapse = ","), 
                   spotsdt = paste(DateTime, collapse = ","), 
                   spotsinfos = paste(cat, collapse = ","), 
                   effrespflags = paste(effrespflag, collapse = ","))

dplyr solution

mergeddf3.dplyr <- 
  mergeddf2 %>% 
  group_by(df.activ.id, channel) %>%
  summarise_each(funs = funs(paste(., collapse = ",")), mainID, DateTime, cat, effrespflag) %>%
  magrittr::set_colnames(c("df.activ.id", "channel", "spotsids", "spotsdt", "spotsinfos", "effrespflags")) 
# check for equality
all.equal(mergeddf3, as.data.frame(mergeddf3.dplyr))
## [1] TRUE

data.table solution

setDT(mergeddf2)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","), 
                                  spotsdt = paste(DateTime, collapse = ","), 
                                  spotsinfos = paste(cat, collapse = ","), 
                                  effrespflags = paste(effrespflag, collapse = ",")),
                           by=list(df.activ.id,channel)] 
# check for equality
all.equal(mergeddf3, setDF(setkeyv(mergeddf3test, c("df.activ.id", "channel"))))
## [1] TRUE