1
votes

I have a data table and I get duplicated rows when using the summarise function combined with group_by.

I will give a simplified example of my problem. First, I am using group_by and mutate to add the sum of 'value' for each id.

dt <- data.table(id = rep(1:5, each=10), cpc = rep((0.1*seq(5)), each=2), value = 1:50)

dt2 <- dt %>%
  group_by(id) %>%
  mutate(SumValue = sum(value))


Source: local data table [50 x 4]

   id cpc value SumValue
1   1 0.1     1       55
2   1 0.1     2       55
3   1 0.2     3       55
4   1 0.2     4       55
5   1 0.3     5       55
6   1 0.3     6       55
7   1 0.4     7       55
8   1 0.4     8       55
9   1 0.5     9       55
10  1 0.5    10       55
.. .. ...   ...      ...

So far, nothing wrong. But after that, when I do group_by for each id,cpc combination and use summarise, the output is not as I expected. The numbers are correct, but there are duplicated rows.

dt2 %>%
  group_by(id, cpc) %>%
  summarise(count = n(), SumValue = SumValue)

Source: local data table [50 x 4]
Groups: id

   id cpc count SumValue
1   1 0.1     2       55
2   1 0.1     2       55
3   1 0.2     2       55
4   1 0.2     2       55
5   1 0.3     2       55
6   1 0.3     2       55
7   1 0.4     2       55
8   1 0.4     2       55
9   1 0.5     2       55
10  1 0.5     2       55
.. .. ...   ...      ...

Using unique() gives the desired result, but I suppose that this is not necessary.

dt2 %>%
  group_by(id, cpc) %>%
  summarise(count = n(), SumValue = SumValue) %>%
  unique()

Source: local data table [25 x 4]
Groups: id

   id cpc count SumValue
1   1 0.1     2       55
2   1 0.2     2       55
3   1 0.3     2       55
4   1 0.4     2       55
5   1 0.5     2       55
6   2 0.1     2      155
7   2 0.2     2      155
8   2 0.3     2      155
9   2 0.4     2      155
10  2 0.5     2      155
.. .. ...   ...      ...

I thought that group_by sets group when add=FALSE, so I don't know why the duplicated rows are emerging.

2
Try dt2 %>% group_by(id, cpc) %>% summarise(count=n(), SumValue=SumValue[1L]) . In the dt2, you created SumValue using mutate, So there must be mutliple rows that have the same SumValue for each id - akrun
Thanks akrun, your solution seems to work without using unique(). Now I see that the problem is the SumValue=SumValue part. - Mizuhirato

2 Answers

2
votes

I'm not sure the exact problem, but when you assign a vector to a summarize variable, you usually need pick one element of the vector. So in this case you need to pick which value of "SumValue" to assign to a the summarised "SumValue". I usually just do this with variable[1], but first(variable) might be better. Does this give you the result you're looking for?

dt <- data_frame(id = rep(1:5, each=10), 
                 cpc = rep((0.1*seq(5)), each=10), 
                 value = 1:50)
dt2 <- dt %>%
  group_by(id) %>%
  mutate(SumValue = sum(value))
dt2

dt2 %>%
  group_by(id, cpc) %>%
  summarise(count = n(), SumValue = first(SumValue))
0
votes

Try using unique in this way. Not exactly sure if this will help.

R> df <- data.frame(a=c(1,2,3,4,3,2), b=c(4,5,6,6,4,3)) 
R> unique(unlist(df)) 
[1] 1 2 3 4 5 6