Efficiently removing `NAs` in repeated measures designs using `tidyverse`

Question

This is not so much question about how to do something but more about how to do it efficiently. In particular, I would like to drop NAs in a repeated measures design in such a way that each group has all complete observations.

In the bugs_long dataframe below, the same participant takes part in four condition and report their desire to kill bugs in each condition. Now if I wanted to carry out some repeated measures analysis with this dataset, this typically doesn't work in the long format because a different number of observations are found for each group after the pairwise exclusion of NAs. So the final dataframe should leave out the following five subjects.

# setup
set.seed(123)
library(ipmisc)
library(tidyverse)

# looking at the NAs
dplyr::filter(bugs_long, is.na(desire)) 
#> # A tibble: 5 x 6
#>   subject gender region        education condition desire
#>     <int> <fct>  <fct>         <fct>     <chr>      <dbl>
#> 1       2 Female North America advance   LDHF          NA
#> 2      80 Female North America less      LDHF          NA
#> 3      42 Female North America high      HDLF          NA
#> 4      64 Female Europe        some      HDLF          NA
#> 5      10 Female Other         high      HDHF          NA

Here is the current roundabout way I am hacking this and getting it to work:

# figuring out the number of levels in the grouping factor
x_n_levels <- nlevels(as.factor(bugs_long$condition))[[1]]

# removing observations that don't have all repeated values
df <-
  bugs_long %>%
  filter(!is.na(condition)) %>%
  group_by(condition) %>%
  mutate(id = dplyr::row_number()) %>%
  ungroup(.) %>%
  filter(!is.na(desire)) %>%
  group_by(id) %>%
  mutate(n = dplyr::n()) %>%
  ungroup(.) %>%
  filter(n == x_n_levels) %>%
  select(-n)

# did this work? yes
df %>%
  group_by(condition) %>%
  count()
#> # A tibble: 4 x 2
#> # Groups:   condition [4]
#>   condition     n
#>   <chr>     <int>
#> 1 HDHF         88
#> 2 HDLF         88
#> 3 LDHF         88
#> 4 LDLF         88

But I would be surprised if the tidyverse (dplyr + tidyr) doesn't have a more efficient way to achieve this and I would really appreciate it if anyone else has a better refactoring this.

camille camille · Accepted Answer · 2020-02-17T16:22:02

You're actually making this much more complicated than it needs to be. Once you find the cases to exclude, it's just a simple task of removing rows in your data that match those subjects, i.e. an anti-join. Some useful discussions here and here.

set.seed(123)
library(ipmisc)
library(dplyr)

exclude <- filter(bugs_long, is.na(desire))
full_cases <- bugs_long %>%
  anti_join(exclude, by = "subject")

Or do the filtering and anti-joining in one go, similar to what you might do in SQL:

bugs_long %>%
  anti_join(filter(., is.na(desire)), by = "subject")

For either way, the number of cases kept checks out:

count(full_cases, condition)
#> # A tibble: 4 x 2
#>   condition     n
#>   <chr>     <int>
#> 1 HDHF         88
#> 2 HDLF         88
#> 3 LDHF         88
#> 4 LDLF         88

Efficiently removing `NAs` in repeated measures designs using `tidyverse`

1 Answers