1
votes

I have a data table and I get duplicated rows when using the summarise function combined with group_by.

I will give a simplified example of my problem. First, I am using group_by and mutate to add the sum of 'value' for each id.

dt <- data.table(id = rep(1:5, each=10), cpc = rep((0.1*seq(5)), each=2), value = 1:50)

dt2 <- dt %>%
  group_by(id) %>%
  mutate(SumValue = sum(value))


Source: local data table [50 x 4]

   id cpc value SumValue
1   1 0.1     1       55
2   1 0.1     2       55
3   1 0.2     3       55
4   1 0.2     4       55
5   1 0.3     5       55
6   1 0.3     6       55
7   1 0.4     7       55
8   1 0.4     8       55
9   1 0.5     9       55
10  1 0.5    10       55
.. .. ...   ...      ...

So far, nothing wrong. But after that, when I do group_by for each id,cpc combination and use summarise, the output is not as I expected. The numbers are correct, but there are duplicated rows.

dt2 %>%
  group_by(id, cpc) %>%
  summarise(count = n(), SumValue = SumValue)

Source: local data table [50 x 4]
Groups: id

   id cpc count SumValue
1   1 0.1     2       55
2   1 0.1     2       55
3   1 0.2     2       55
4   1 0.2     2       55
5   1 0.3     2       55
6   1 0.3     2       55
7   1 0.4     2       55
8   1 0.4     2       55
9   1 0.5     2       55
10  1 0.5     2       55
.. .. ...   ...      ...

Using unique() gives the desired result, but I suppose that this is not necessary.

dt2 %>%
  group_by(id, cpc) %>%
  summarise(count = n(), SumValue = SumValue) %>%
  unique()

Source: local data table [25 x 4]
Groups: id

   id cpc count SumValue
1   1 0.1     2       55
2   1 0.2     2       55
3   1 0.3     2       55
4   1 0.4     2       55
5   1 0.5     2       55
6   2 0.1     2      155
7   2 0.2     2      155
8   2 0.3     2      155
9   2 0.4     2      155
10  2 0.5     2      155
.. .. ...   ...      ...

I thought that group_by sets group when add=FALSE, so I don't know why the duplicated rows are emerging.

2
Try dt2 %>% group_by(id, cpc) %>% summarise(count=n(), SumValue=SumValue[1L]) . In the dt2, you created SumValue using mutate, So there must be mutliple rows that have the same SumValue for each idakrun
Thanks akrun, your solution seems to work without using unique(). Now I see that the problem is the SumValue=SumValue part.Mizuhirato

2 Answers

2
votes

I'm not sure the exact problem, but when you assign a vector to a summarize variable, you usually need pick one element of the vector. So in this case you need to pick which value of "SumValue" to assign to a the summarised "SumValue". I usually just do this with variable[1], but first(variable) might be better. Does this give you the result you're looking for?

dt <- data_frame(id = rep(1:5, each=10), 
                 cpc = rep((0.1*seq(5)), each=10), 
                 value = 1:50)
dt2 <- dt %>%
  group_by(id) %>%
  mutate(SumValue = sum(value))
dt2

dt2 %>%
  group_by(id, cpc) %>%
  summarise(count = n(), SumValue = first(SumValue))
0
votes

Try using unique in this way. Not exactly sure if this will help.

R> df <- data.frame(a=c(1,2,3,4,3,2), b=c(4,5,6,6,4,3)) 
R> unique(unlist(df)) 
[1] 1 2 3 4 5 6