Filter duplicates based on one row and exclude them based on value in other row

Question

I have a df with longitudinal data. Each id has several measurements of seconds, but some of them are duplicates. I want to remove the duplicates per id in the seconds column, but I want to remove the duplicates that have NA in the measurement column, as the rows that have a value in the measurements value contain other information that I need to keep. So basically I need to identify the duplicates, and then remove the ones with NA in another column. Edit: But the problem is, that there are also duplicates where the measurement rows are not NA but e.g. 1 and 2. Then I need to remove the 2, as measurement 1 includes the relevant other information. Also sometimes there are duplicates that both have NA in measurement.

My df looks somewhat like this:

id	measurement	seconds	other relevant information
A	1	5000	blue
A	NA	5000	NA
A	2	4000	NA
B	1	3400	red
B	2	5000	NA
B	NA	5000	NA
C	NA	3000	NA
C	1	3000	blue
D	1	2000	green
D	2	2000	NA
D	3	1000	NA
D	NA	4000	NA
D	NA	4000	NA

And I need it to be like this after:

id	measurement	seconds	...
A	1	5000	blue
A	2	4000	NA
B	1	3400	red
B	2	5000	NA
C	1	3000	blue
D	1	2000	green
D	3	1000	NA
D	NA	4000	NA

I know there is probably a pretty straight forward dplyr solution, and I tried this so far:

df <- df %>% group_by(id, seconds) %>% filter(n() > 1)

and this:

df <- df %>%
   group_by(id) %>%
   filter(measurement==NA & duplicates(seconds)

So basically the if else command I need is:

if "seconds" = duplicate & one measurement row == na -> delete the  na row
if "seconds" = duplicate & measurement == not na -> keep seconds = 1 row
if "seconds" = duplicate & both measurement == na -> delete randomly one

What happens when you have NA in measurement and it is not a duplicate? Can you include such condition in the example? — Ronak Shah
Thanks a lot for the help! I just realized that won't work for me though, as some duplicates are in rows were the measurement value is 1 or 2 instead of NA and I still need to remove them. — Luca
Yes, that is one of the problems, then it excludes that row. I'll update the example above to show that — Luca

JBGruber JBGruber · Accepted Answer · 2021-05-21T09:33:40

measurement==NA always returns NA. Use is.na(measurement) instead. Also, you are looking for the command duplicated.

So:

df <- df %>%
   group_by(id) %>%
   filter(is.na(measurement) & duplicated(seconds)

or

df <- df %>%
   group_by(id) %>%
   filter(is.na(measurement), duplicated(seconds)

Filter duplicates based on one row and exclude them based on value in other row

2 Answers