How do you remove rows from a tibble where the non-missing values match a subset of values in other rows?

Question

I am looking for an efficient way to remove rows of a tibble where the non-missing values are identical to missing values in another row. Consider this fake example:

library(tidyverse)
phony_genes <- tribble(
  ~mouse_entrez, ~mgi_symbol, ~human_entrez, ~hgnc_symbol,
    1,             "a",          2       ,       "A",
    1,             "a",          2       ,        NA,
    1,              NA,          2       ,        "A",
    1,             "a",          3       ,        NA,
    4,             "b",          3       ,        NA,
    5,              NA,          2       ,        "A"
  )

Row 2 is a subset of row 1, because each non-missing value is in row 2 is the same as in row 1. Same goes for row 3, but a different value is missing. I am looking for a way that uses the tidyverse (or other packages) to filter out rows 2 and 3, but keep the other rows. I can't filter out the NA values in hgnc_symbol or mgi_symbol because in both cases I will lose rows that I want to keep. I can't group by mouse_entrez and filter away the NA values within the groups because I want to keep row 4. This simple example could of course be expanded to a huge tibble. I could probably do this by coding something myself but I am wondering if anyone has an elegant solution.

There are two solutions so far, but both involve explicitly choosing grouping variables. A solution that I am looking for would work automatically without having to explicitly select grouping variables. — Jordan Mandel
Is every row matched with only next row or every other row in mouse_entrez ? So row 1 is matched with row 2, row 2 with 3 or row 1 is matched with 2, 3 and 4 ? — Ronak Shah
There is no grouping or special ordering of the input tibble.The only rule is that rows that are a subset of another are deleted. — Jordan Mandel
Why row 2 and 3 should be removed? They have different non-missing values. Row 2 has 1 'a' and 2 whereas row 3 has 1, 2 and 'A'. a and A are different. Do you want to ignore the case? — Ronak Shah
They are both subsets of row 1 so they should be removed. I could solve this with nested for-loops but was wondering if there is a tidyverse solution. — Jordan Mandel

r2evans r2evans · Accepted Answer · 2020-03-11T23:06:01

library(dplyr)
phony_genes %>%
  group_by(mouse_entrez, mgi_symbol, human_entrez) %>%
  arrange_all(~ is.na(.)) %>%
  slice(1)
# # A tibble: 4 x 4
# # Groups:   mouse_entrez, mgi_symbol, human_entrez [4]
#   mouse_entrez mgi_symbol human_entrez hgnc_symbol
#          <dbl> <chr>             <dbl> <chr>      
# 1            1 a                     2 A          
# 2            1 a                     3 <NA>       
# 3            4 b                     3 <NA>       
# 4            5 c                     2 A

How do you remove rows from a tibble where the non-missing values match a subset of values in other rows?

3 Answers