I have a dataframe df
:
id year groupid
1 A1 2000 G1
2 A1 2000 G1
3 A1 2000 G1
4 A2 2000 G2
5 A1 2001 G1
6 A12 2001 G1
7 A13 2001 G1
8 A3 2001 G2
9 A33 2001 G2
10 A4 2001 G3
11 A4 2002 G3
12 A5 2002 G3
13 A5 2003 G2
14 A6 2003 G4
What I would like to do is to use setdiff()
between values with the same groupid
of two consecutive years.
Example :
For the year 2000, G1
has one id
: A1
. For the year 20001, G1
has three different id
: A1
, A12
, A13
. So when setdiff()
is applied between those two, it will return 2
. Between the year 2001 and 2002, since G1
is not present in the year 2002, the value 0
will be given. For the same groupid
, if two years are not consecutive, the value given will be either 0
if the group is not present in the second year or the number of the different id
.
Expected results :
year groupid newid
1 2000 G1 1
2 2000 G2 1
3 2000 G3 0
4 2000 G4 0
5 2001 G1 2
6 2001 G2 2
7 2001 G3 1
8 2001 G4 0
9 2002 G1 0
10 2002 G2 0
11 2002 G3 2
12 2002 G4 0
13 2003 G1 0
14 2003 G2 1
15 2003 G3 0
16 2003 G4 1
I have done this with a for loop
, some if()
and dplyr
functions but since I have a lot of rows, it took a bit too much time (around 5 minutes). So I'm searching to replace the loop by some dplyr
functions or data.table
functions to do this task if less time.
Data :
structure(list(id = c("A1", "A1", "A1", "A2", "A1", "A12", "A13",
"A3", "A33", "A4", "A4", "A5", "A5", "A6"), year = c(2000, 2000,
2000, 2000, 2001, 2001, 2001, 2001, 2001, 2001, 2002, 2002, 2003,
2003), groupid = c("G1", "G1", "G1", "G2", "G1", "G1", "G1",
"G2", "G2", "G3", "G3", "G3", "G2", "G4")), .Names = c("id",
"year", "groupid"), row.names = c(NA, -14L), class = "data.frame")
EDIT : Modified the example
c("A","B")
andG1_2002
would bec("B","C")
you want the function to return 1, right? – Janna Maas