I've been learning R by searching how others have done things on Stackoverflow and as a result, I've become acquainted with plyr syntax. I have the following four plyr calls with ddply that are the rate limiting steps to my code. My data approaches the hundreds of thousands to millions of records and my code mostly flies thanks to data.table and is only limited by four rate limiting but critical plyr steps. I would like to replace these with dplyr or data.table, but I've been struggling to replicate the syntax and would appreciate any help.
1. mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize, spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(cat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))
2. webuniq_test <- ddply(webuniq, c("df.activ.id"),summarise, strRM = paste(replicate(RMCount, "RM"), collapse = ","))
3. webactiv2 <- ddply(webactiv, .(VisitorID), summarize, VisitorPath = paste(Path, collapse = ","), RMpath = paste(strRM, collapse = ","), ConvTot=sum(Conv), Conv2Tot=sum(Conv2), Cov3Tot=sum(Conv3)) #check that nrow dec
4. MeltForSO3 <- ddply(MeltForSO2, c("VisitorID","ID"),summarise, SplitThis = paste(value, collapse = ","))
For (1) here is the benchmark:
#user system elapsed
#378.463 3.136 383.786
Here's what I'm trying to accomplish in these steps (they are similar):
- They involve the aggregation of data by an ID field or ID field
- Aggregation of granular character fields occurs with paste and
collapse. For example, a field might be a driver's stops as he drops
off packages where a "stops" field would have values 'a', 'b', 'c'
for each stop. Plyr with
stops_path = paste(stops, collapse = ",")
will give aggregate those steps to a single line as "a,b,c" - Numeric data is sometimes summed up in the same aggregation step, for example
ConvTot=sum(Conv)
I've been unsuccessful in my attempts to replicate this with either dplyr or data.table.
Is there an advantage to using one over the other for these types of aggregations? I've taken a look at this and it seems that data.table might be better for my very simple use case because of the cleaner syntax: data.table vs dplyr: can one do something well the other can't or does poorly?
Here's my unsuccessful attempt to replicate (1) above with data.table:
setkey(setDT(mergeddf2),df.activ.id, MarketingChannel)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ",")), by=list(df.activ.id,Channel)]
This threw an error: unused argument (by = list(df.activ.id, Channel))
I wrote that from code off I researched on SO on how to incorporate paste into data.table. I took out the by argument just to see what would happen, and got another error with the line below:
mergeddf3test <- mergeddf2[, list(spotsids = paste(spotID, collapse = ","), spotsdt = paste(DateTime, collapse = ","), spotsinfos = paste(tvcat, collapse = ","), effrespflags = paste(effrespflag, collapse = ","))]
The error was "Error in paste(spotID, collapse = ",") : object 'spotID' not found"
which is weird because that field is definitely in the data. I thought this data.table line would properly aggregate the data with the by fields (df.activ.id and Channel) and combine the character fields as with the (a,b,c) example above.
Clearly I'll need to properly learn the syntax of either dplyr or data.table given the scale of data I'm working with so I've signed up for the data camp class for both packages. Still, I would appreciate any help on how to approach this in the immediate term.
Thanks!
setDT
is not actually setting as a data.table. You could check withstr(mergeddf2)
. I can't imagine any other reason forby
and column names likespotID
not being found. Personally, I think everything before "Here's my attempt" in your question could go -- you should stick to a single programming problem on this site, as that's what works best for the Q&A format. – Frank