0
votes

I have been researching this for a while and I can't seem to find the issue. I use dplyr regularly, but seems like all of a sudden, I am getting odd output from the group_by/summarise combination.

I have a large dataset and I am trying to summarize it using the following:

dataAgg <- dataRed %>% group_by(ClmNbr, SnapshotDay, Pre2016) %>%
  filter(SnapshotDay == '30'| SnapshotDay == '90') %>%
  summarise(
    NumFeat = sum(FeatureNbr),
    TotInc = sum(IncSnapshotDay),
    TotDelta = sum(InctoFinal),
    TotPaid = sum(FinalPaid)
  )

The setup of the data frame is below:

'data.frame':   123819 obs. of  8 variables:
 $ ClmNbr        : Factor w/ 33617 levels "14-00765132",..: 2162 2163 2163 2164 1842 2287 27 27 27 28 ...
 $ SnapshotDay   : Factor w/ 3 levels "7","30","90": 1 1 1 1 1 1 1 1 1 1 ...
 $ Pre2016       : Factor w/ 2 levels "Post2016","Pre2016": 2 2 2 2 2 2 2 2 2 2 ...
 $ FeatureNbr    : int  6 2 3 3 6 2 4 5 6 5 ...
 $ IncSnapshotDay: num  5000 77 5000 4500 77 2200 1800 1100 1800 25000 ...
 $ FinalPaid     : num  442 0 15000 5000 0 ...
 $ InctoFinal    : num  -4558 -77 10000 500 -77 ...
 $ TimeDelta     : num  25.833 2.833 2.833 0.833 1.833 ...

When I execute the code, I get 1 obs. of 4 variables; there is no grouping applied.

'data.frame':   1 obs. of  4 variables:
 $ NumFeat : int 287071
 $ TotInc  : num NA
 $ TotDelta: num NA
 $ TotPaid : num 924636433

I used to do this all the time without problems.

I could use aggregate, but sometimes, I am mixing and matching functions based on the column so it does not always work.

What am I doing wrong?

1
Could you have loaded plyr before dplyr?aosmith
@aosmith: you meant plyr "after" dplyr right?Tung
@Tung Yes, "after" is what might cause the problem. :-Daosmith
No, I am not using plyr at all; unless somehow it gets cached, but I frequently clean the global environment.Bryan Butler
What does sessioInfo() show? When asking for help, you should include a simple reproducible example with sample input and desired output that can be used to test and verify possible solutions. A str() isn't as helpful as a dput() for testing.MrFlick

1 Answers

1
votes

So, after a bit of research and some experimentation, the order of the library load matters. The original order was the following:

library(RODBC)
library(dplyr)
library(DT)
library(reshape2)
library(ggplot2)
library(scales)
library(caret)
library(markovchain)
library(knitr)
library(Metrics)
library(RColorBrewer)

However, ggplot2 loads in plyr as a dependency, so in order to make this work more smoothly, the order should be revised to load dplyr last; which is what I used to do.

library(RODBC)
library(DT)
library(reshape2)
library(ggplot2)
library(scales)
library(caret)
library(markovchain)
library(knitr)
library(Metrics)
library(RColorBrewer)
library(dplyr)

Alternately, as in Python, it can be accomplished by specifying the library to execute the command. In Python, we import libraries in the following syntax:

import numpy as np

Then any numpy commmands are referenced using np. like np.array() the R syntax is the following library::

Adding dplyr:: to the commands fixes the problem as shown below.

dataAgg <- dataRed %>% dplyr::group_by(ClmNbr, SnapshotDay, Pre2016) %>%
  dplyr::filter(SnapshotDay == '30'| SnapshotDay == '90') %>%
  dplyr::summarise(
    NumFeat = sum(FeatureNbr),
    TotInc = sum(IncSnapshotDay),
    TotDelta = sum(InctoFinal),
    TotPaid = sum(FinalPaid)
  )