1
votes

I have a dataframe df :

    id year groupid
1   A1 2000      G1
2   A1 2000      G1
3   A1 2000      G1
4   A2 2000      G2
5   A1 2001      G1
6  A12 2001      G1
7  A13 2001      G1
8   A3 2001      G2
9  A33 2001      G2
10  A4 2001      G3
11  A4 2002      G3
12  A5 2002      G3
13  A5 2003      G2
14  A6 2003      G4

What I would like to do is to use setdiff() between values with the same groupid of two consecutive years.

Example :

For the year 2000, G1 has one id : A1. For the year 20001, G1 has three different id : A1, A12, A13. So when setdiff() is applied between those two, it will return 2. Between the year 2001 and 2002, since G1 is not present in the year 2002, the value 0 will be given. For the same groupid, if two years are not consecutive, the value given will be either 0 if the group is not present in the second year or the number of the different id.

Expected results :

   year groupid newid
1  2000    G1    1
2  2000    G2    1
3  2000    G3    0
4  2000    G4    0
5  2001    G1    2
6  2001    G2    2
7  2001    G3    1
8  2001    G4    0
9  2002    G1    0
10 2002    G2    0
11 2002    G3    2
12 2002    G4    0
13 2003    G1    0
14 2003    G2    1
15 2003    G3    0
16 2003    G4    1

I have done this with a for loop, some if() and dplyr functions but since I have a lot of rows, it took a bit too much time (around 5 minutes). So I'm searching to replace the loop by some dplyr functions or data.table functions to do this task if less time.

Data :

structure(list(id = c("A1", "A1", "A1", "A2", "A1", "A12", "A13", 
"A3", "A33", "A4", "A4", "A5", "A5", "A6"), year = c(2000, 2000, 
2000, 2000, 2001, 2001, 2001, 2001, 2001, 2001, 2002, 2002, 2003, 
2003), groupid = c("G1", "G1", "G1", "G2", "G1", "G1", "G1", 
"G2", "G2", "G3", "G3", "G3", "G2", "G4")), .Names = c("id", 
"year", "groupid"), row.names = c(NA, -14L), class = "data.frame")

EDIT : Modified the example

1
Can you post your solution as well please?Sotos
if 'G1_2001' would be c("A","B") and G1_2002 would be c("B","C") you want the function to return 1, right?Janna Maas
@JannaMaas yeah the order doesn't matter with setdiff()Omlere
I removed my answer since I misunderstood your question. Could you please change your example to illustrate such overlaps between consecutive years? Your current one doesn'tAurèle
@Aurèle I don't get what you're saying. What part did you not understand ?Omlere

1 Answers

4
votes

This solution is a combination of my initial one using tidyr and dplyr, and of @jogo's (now deleted) answer using base R aggregate(drop = FALSE):

df <- read.table(header = T, stringsAsFactors = F, text = 
"    id year groupid
1   A1 2000      G1
2   A1 2000      G1
3   A1 2000      G1
4   A2 2000      G2
5   A1 2001      G1
6  A12 2001      G1
7  A13 2001      G1
8   A3 2001      G2
9  A33 2001      G2
10  A4 2001      G3
11  A4 2002      G3
12  A5 2002      G3
13  A5 2003      G2
14  A6 2003      G4")

library(dplyr)
df %>% 
  aggregate(id ~ year + groupid, ., unique, drop = FALSE) %>% 
  group_by(groupid) %>% 
  arrange(year) %>% 
  mutate(new_ids = mapply(setdiff, id, lag(id)),
         newid = lapply(new_ids, length)) %>% 
  ungroup() %>% 
  arrange(year, groupid) %>% 
  as.data.frame()

#    year groupid           id  new_ids newid
# 1  2000      G1           A1       A1     1
# 2  2000      G2           A2       A2     1
# 3  2000      G3                           0
# 4  2000      G4                           0
# 5  2001      G1 A1, A12, A13 A12, A13     2
# 6  2001      G2      A3, A33  A3, A33     2
# 7  2001      G3           A4       A4     1
# 8  2001      G4                           0
# 9  2002      G1                           0
# 10 2002      G2                           0
# 11 2002      G3       A4, A5       A5     1
# 12 2002      G4                           0
# 13 2003      G1                           0
# 14 2003      G2           A5       A5     1
# 15 2003      G3                           0
# 16 2003      G4           A6       A6     1  

Edit: explanations, answer to comment:

  • The , ., part is to say df should be the second argument to aggregate(), not the first (see help("%>%")). But really it was to make it shorter, you'd be better off using a longer and more robust form with named arguments, like: aggregate(formula = id ~ year + groupid, data = ., FUN = unique, drop = FALSE).

  • The added group_by() is necessary to make the use of lag() consistent. Without groups, we'd take the risk that the ids of 2000, G2 are compared to those of 2003, G1 (consecutive lines after aggregate()). What we really want is setdiff(..., lag(... to happen inside a groupid group, with rows ordered by year (hence the probably unnecessary arrange(year)). The lack of this group_by() makes no difference on this example, but would probably make one on your real life data.