I have a df with longitudinal data. Each id has several measurements of seconds, but some of them are duplicates. I want to remove the duplicates per id in the seconds column, but I want to remove the duplicates that have NA in the measurement column, as the rows that have a value in the measurements value contain other information that I need to keep. So basically I need to identify the duplicates, and then remove the ones with NA in another column. Edit: But the problem is, that there are also duplicates where the measurement rows are not NA but e.g. 1 and 2. Then I need to remove the 2, as measurement 1 includes the relevant other information. Also sometimes there are duplicates that both have NA in measurement.
My df looks somewhat like this:
id | measurement | seconds | other relevant information |
---|---|---|---|
A | 1 | 5000 | blue |
A | NA | 5000 | NA |
A | 2 | 4000 | NA |
B | 1 | 3400 | red |
B | 2 | 5000 | NA |
B | NA | 5000 | NA |
C | NA | 3000 | NA |
C | 1 | 3000 | blue |
D | 1 | 2000 | green |
D | 2 | 2000 | NA |
D | 3 | 1000 | NA |
D | NA | 4000 | NA |
D | NA | 4000 | NA |
And I need it to be like this after:
id | measurement | seconds | ... |
---|---|---|---|
A | 1 | 5000 | blue |
A | 2 | 4000 | NA |
B | 1 | 3400 | red |
B | 2 | 5000 | NA |
C | 1 | 3000 | blue |
D | 1 | 2000 | green |
D | 3 | 1000 | NA |
D | NA | 4000 | NA |
I know there is probably a pretty straight forward dplyr solution, and I tried this so far:
df <- df %>% group_by(id, seconds) %>% filter(n() > 1)
and this:
df <- df %>%
group_by(id) %>%
filter(measurement==NA & duplicates(seconds)
So basically the if else command I need is:
if "seconds" = duplicate & one measurement row == na -> delete the na row
if "seconds" = duplicate & measurement == not na -> keep seconds = 1 row
if "seconds" = duplicate & both measurement == na -> delete randomly one
NA
inmeasurement
and it is not a duplicate? Can you include such condition in the example? – Ronak Shah