2
votes

I am looking for an efficient way to remove rows of a tibble where the non-missing values are identical to missing values in another row. Consider this fake example:

library(tidyverse)
phony_genes <- tribble(
  ~mouse_entrez, ~mgi_symbol, ~human_entrez, ~hgnc_symbol,
    1,             "a",          2       ,       "A",
    1,             "a",          2       ,        NA,
    1,              NA,          2       ,        "A",
    1,             "a",          3       ,        NA,
    4,             "b",          3       ,        NA,
    5,              NA,          2       ,        "A"
  )

Row 2 is a subset of row 1, because each non-missing value is in row 2 is the same as in row 1. Same goes for row 3, but a different value is missing. I am looking for a way that uses the tidyverse (or other packages) to filter out rows 2 and 3, but keep the other rows. I can't filter out the NA values in hgnc_symbol or mgi_symbol because in both cases I will lose rows that I want to keep. I can't group by mouse_entrez and filter away the NA values within the groups because I want to keep row 4. This simple example could of course be expanded to a huge tibble. I could probably do this by coding something myself but I am wondering if anyone has an elegant solution.

3
There are two solutions so far, but both involve explicitly choosing grouping variables. A solution that I am looking for would work automatically without having to explicitly select grouping variables.Jordan Mandel
Is every row matched with only next row or every other row in mouse_entrez ? So row 1 is matched with row 2, row 2 with 3 or row 1 is matched with 2, 3 and 4 ?Ronak Shah
There is no grouping or special ordering of the input tibble.The only rule is that rows that are a subset of another are deleted.Jordan Mandel
Why row 2 and 3 should be removed? They have different non-missing values. Row 2 has 1 'a' and 2 whereas row 3 has 1, 2 and 'A'. a and A are different. Do you want to ignore the case?Ronak Shah
They are both subsets of row 1 so they should be removed. I could solve this with nested for-loops but was wondering if there is a tidyverse solution.Jordan Mandel

3 Answers

1
votes
library(dplyr)
phony_genes %>%
  group_by(mouse_entrez, mgi_symbol, human_entrez) %>%
  arrange_all(~ is.na(.)) %>%
  slice(1)
# # A tibble: 4 x 4
# # Groups:   mouse_entrez, mgi_symbol, human_entrez [4]
#   mouse_entrez mgi_symbol human_entrez hgnc_symbol
#          <dbl> <chr>             <dbl> <chr>      
# 1            1 a                     2 A          
# 2            1 a                     3 <NA>       
# 3            4 b                     3 <NA>       
# 4            5 c                     2 A          
1
votes

Here's a way to do it using tidyverse :

library(dplyr)
library(purrr)

phony_genes %>%
   mutate(col = pmap(., ~na.omit(c(...)))) %>%
   filter(!map_lgl(seq_along(col), function(x) 
          any(map_lgl(col[-x], function(y) all(col[[x]] %in% y))))) %>%
   select(-col)

#  mouse_entrez mgi_symbol human_entrez hgnc_symbol
#         <dbl> <chr>             <dbl> <chr>      
#1            1 a                     2 A          
#2            1 a                     3 NA         
#3            4 b                     3 NA         
#4            5 NA                    2 A          

We get all the values in a row as a character vector removing NA values using pmap. For each row check if a complete duplicate exists and remove those rows using filter.

0
votes

You can group by all columns except the ones where you don't want to remove anything & then remove missing values where total count > 1, e.g.:

phony_genes %>%
  group_by(mouse_entrez, human_entrez) %>%
  filter_at(vars(2, 4), all_vars(!(is.na(.) & n() > 1)))

Output:

# A tibble: 4 x 4
# Groups:   mouse_entrez, human_entrez [4]
  mouse_entrez mgi_symbol human_entrez hgnc_symbol
         <dbl> <chr>             <dbl> <chr>      
1            1 a                     2 A          
2            1 a                     3 NA         
3            4 b                     3 NA         
4            5 NA                    2 A