1
votes

I have a dataset of like the following one. With subjects that join groups over time


df <- data.frame("group" = c("A", "A", "A+1","A+1", "A+1", "B","B+1","B+1"), 
                 "id" = c("id1", "id2", "id1", "id2", "id3", "id5","id5","id1"), 
                 "time" = c(1,1,3,3,3,2,5,5),
                 "Val" = c(10,10,10,10,10,12,12,12),
                  "groupid" = c("A", "A", "A","A", "A", "B","B","B"))

My final aim is to add all of the values grouping by id.

My problem is that as a new subject join groups, for instance, group "A" becomes Group "A+1" when subject "id3" joins at time 3, all of my observations are duplicated ("A+1" include "id1" and "id2". Hence I cannot do my sum.

In sum, I would like to remove duplicated "ids" based on "Group id" identifiers.
the final df should look like this


final <- data.frame("group" = c("A", "A", "A+1", "B","B+1"), 
                 "id" = c("id1", "id2", "id3", "id5","id1"), 
                 "time" = c(1,1,3,2,5),
                 "Val" = c(10,10,10,12,12),
                 "groupid" = c("A", "A", "A", "B","B"))

I thank you a lot in advance for your help


Follow up question

I also have the opposite case whereby some subject leave the group over time. The dataset looks something like this:

df2 <- data.frame("group" = c("A", "A", "A","A_1", "A_1", "B","B","B_1"), 
                 "id" = c("id1", "id2", "id3", "id2", "id3", "id5","id1","id1"), 
                 "time" = c(1,1,1,3,3,2,2,5),
                 "Val" = c(10,10,10,10,10,12,12,12),
                 "groupid" = c("A", "A", "A","A", "A", "B","B","B"))

In this case, one subject "id1" leaves the "group A" that becomes group "A_1", where only "id2" and "id3" are members. Similarly "id5" leaves group B that becomes "B_1" with only id1 as a member.

What I would like to have in the final dataset is an opposite type of groups identification that should look something like this

final2 <- data.frame("group" = c("A", "A", "A","A_1", "B","B","B_1"), 
                     "id" = c("id1", "id2", "id3", "id1", "id5","id1","id5"), 
                     "time" = c(1,1,1,3,2,2,5),
                     "Val" = c(10,10,10,10,12,12,12),
                     "groupid" = c("A", "A", "A","A", "B","B","B"))

Whereby "A_1" and "B_1" only indicate the subjects, "id1" and "id5" respectively, that has left the group, rather than the remaining ones.

Any suggestions on this approach?

I thank you in advance for your help

1
Done, sorry. you are right. The issue is that the process goes both ways. i.e. i have cases were an individual (e.g. "id2", leaves the group (i.e. group "A-1") . Ideally I would also like to isolate those cases. - Alex
I posted a solution for it - akrun

1 Answers

2
votes

Here is one option. After grouping by 'id', remove the substring of 'group starting from the + followed by one or more numbers (\\d+) till the end ($) of the string, apply duplicated on the substring to create a logical vector inside filter for removing the rows that are duplicated

library(dplyr)
df %>% 
   group_by(id) %>% 
   filter(!duplicated(str_remove(group, "\\+\\d+$")))
# A tibble: 5 x 5
# Groups:   id [4]
#  group id     time   Val groupid
#  <fct> <fct> <dbl> <dbl> <fct>  
#1 A     id1       1    10 A      
#2 A     id2       1    10 A      
#3 A+1   id3       3    10 A      
#4 B     id5       2    12 B      
#5 B+1   id1       5    12 B