2
votes

I have a df with longitudinal data. Each id has several measurements of seconds, but some of them are duplicates. I want to remove the duplicates per id in the seconds column, but I want to remove the duplicates that have NA in the measurement column, as the rows that have a value in the measurements value contain other information that I need to keep. So basically I need to identify the duplicates, and then remove the ones with NA in another column. Edit: But the problem is, that there are also duplicates where the measurement rows are not NA but e.g. 1 and 2. Then I need to remove the 2, as measurement 1 includes the relevant other information. Also sometimes there are duplicates that both have NA in measurement.

My df looks somewhat like this:

id measurement seconds other relevant information
A 1 5000 blue
A NA 5000 NA
A 2 4000 NA
B 1 3400 red
B 2 5000 NA
B NA 5000 NA
C NA 3000 NA
C 1 3000 blue
D 1 2000 green
D 2 2000 NA
D 3 1000 NA
D NA 4000 NA
D NA 4000 NA

And I need it to be like this after:

id measurement seconds ...
A 1 5000 blue
A 2 4000 NA
B 1 3400 red
B 2 5000 NA
C 1 3000 blue
D 1 2000 green
D 3 1000 NA
D NA 4000 NA

I know there is probably a pretty straight forward dplyr solution, and I tried this so far:

df <- df %>% group_by(id, seconds) %>% filter(n() > 1)

and this:

df <- df %>%
   group_by(id) %>%
   filter(measurement==NA & duplicates(seconds)

So basically the if else command I need is:

if "seconds" = duplicate & one measurement row == na -> delete the  na row
if "seconds" = duplicate & measurement == not na -> keep seconds = 1 row
if "seconds" = duplicate & both measurement == na -> delete randomly one
2
What happens when you have NA in measurement and it is not a duplicate? Can you include such condition in the example?Ronak Shah
Thanks a lot for the help! I just realized that won't work for me though, as some duplicates are in rows were the measurement value is 1 or 2 instead of NA and I still need to remove them.Luca
Yes, that is one of the problems, then it excludes that row. I'll update the example above to show thatLuca

2 Answers

2
votes

measurement==NA always returns NA. Use is.na(measurement) instead. Also, you are looking for the command duplicated.

So:

df <- df %>%
   group_by(id) %>%
   filter(is.na(measurement) & duplicated(seconds)

or

df <- df %>%
   group_by(id) %>%
   filter(is.na(measurement), duplicated(seconds)
1
votes

I think I figured it out:

df <- df %>%
  arrange(measurement) %>%
   group_by(id) %>%
   filter(!duplicated(seconds))