Omitting NAs and empty data frames with map_if

Question

I'm working with a tibble like below:

ex <- structure(list(rowid = c(4L, 5L, 6L, 9L, 10L), timestamp = structure(c(1502480694.03336, 
1502480695.44736, 1502480696.03336, 1502480703.99836, 1502480706.19936
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), cat = c(32L, 
1L, 1L, 1L, 1L), var1 = structure(c(NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_), .Label = "1", class = "factor"), 
    var2 = c(0, 50, 29.7, 51, 70.8), var3 = c(NA, 26.3, 24, 20.5, 
    12), order = c(NA, 1L, 1L, 1L, 1L), bfr = list(NA, structure(list(
        rowid = integer(0), timestamp = structure(numeric(0), class = c("POSIXct", 
        "POSIXt"), tzone = "UTC"), cat = integer(0), var1 = structure(integer(0), .Label = "1", class = "factor"), 
        var2 = numeric(0), var3 = numeric(0), order = integer(0)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = integer(0)), structure(list(
        rowid = 5L, timestamp = structure(1502480695.44736, class = c("POSIXct", 
        "POSIXt"), tzone = "UTC"), cat = 1L, var1 = structure(NA_integer_, .Label = "1", class = "factor"), 
        var2 = 50, var3 = 26.3, order = 1L), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -1L)), structure(list(
        rowid = 5:8, timestamp = structure(c(1502480695.44736, 
        1502480696.03336, 1502480699.03336, 1502480701.03336), class = c("POSIXct", 
        "POSIXt"), tzone = "UTC"), cat = c(1L, 1L, 1L, 1L), var1 = structure(c(NA_integer_, 
        NA_integer_, NA_integer_, NA_integer_), .Label = "1", class = "factor"), 
        var2 = c(50, 29.7, 52.8, 44), var3 = c(26.3, 24, 8.9, 
        12.4), order = c(1L, 1L, 1L, 1L)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -4L)), structure(list(
        rowid = 5:9, timestamp = structure(c(1502480695.44736, 
        1502480696.03336, 1502480699.03336, 1502480701.03336, 
        1502480703.99836), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
        cat = c(1L, 1L, 1L, 1L, 1L), var1 = structure(c(NA_integer_, 
        NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = "1", class = "factor"), 
        var2 = c(50, 29.7, 52.8, 44, 51), var3 = c(26.3, 24, 
        8.9, 12.4, 20.5), order = c(1L, 1L, 1L, 1L, 1L)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -5L)))), row.names = c(4L, 
5L, 6L, 9L, 10L), class = "data.frame")

I want to summarise nested tibbles in column bfr with map. To omit unnecessary calculations, I want to go with map_if which would skip the row when bfr contains less than 2 rows with cat == 1. However due to presence of NAs and empty tibbles in bfr column, I'm struggling with writing appropriate predicate function. Here's my attempt:

more_than <- function(df){
  if (nrow(df) == 0 | is.na(df)) return(FALSE)

  n <- df %>% 
    summarise(sum(cat == 1)) %>% 
    as.numeric()

  return(n > 2)
}

ex %>% 
  mutate(mean_var2 = map_if(bfr, more_than, 
                            ~.x %>% summarise(mean_var2 = mean(var2))))

which results in:

Error in if (nrow(df) == 0 | is.na(df)) return(FALSE) : argument is of length zero

How can I deal with the presence of both NAs and empty tibbles to write appropriate predicate function?

The issue is with is.na(df), which does the NA check on the whole data while nrow is a single output — akrun
Also, in the more_than, you are doing some other calculation which you are not getting as output in the mean_var2 — akrun
Sorry, I don't get your first comment - could you possibly elaborate in your answer? more_than is just a predicate to avoid unneccesary calculations for some elements of bfr column. — jakes

akrun akrun · Accepted Answer · 2019-04-16T05:52:19

If the intention is to get the mean of 'var2' column, check the list elements are either data.frame or tibble (in this case it is a tibble) and then do the summarise

out <-  ex %>% 
           mutate(mean_var2 = map_if(bfr, is.tibble, ~ 
             .x %>% 
                summarise(mean_var2 = mean(var2, na.rm = TRUE))))

If we also need to check sum(cat ==1) > 2

more_than <- function(df){
i1 <- is_tibble(df)
if(i1) {
   n <- df %>% 
    summarise(v1 = sum(cat == 1))  %>%
    pull(v1) 
    }

    i1 && (n > 2)


}
ex %>%
  mutate(mean_var2 = map_if(bfr, more_than, ~
      .x %>%
         summarise(mean_var2 = mean(var2, na.rm = TRUE))))

The reason why is.na is not working is because it checks for each dataset and in some of them it is a tibble and this returns a logical matrix, while if/else expects a single TRUE/FALSE to return. For e.g.

(3 == 4) & (cbind(3:5, 1:3) == 3)

yields a different output

One option is to use &&, so that it evaluates the rhs condition only if the first condition is TRUE and thereby avoiding unncessary evaluation

(3 == 4) && (cbind(3:5, 1:3) == 3)
#[1] FALSE

In the OP's original function, if we replace the | with || it should work fine

more_than <- function(df){
  if (nrow(df) == 0 || is.na(df)) return(FALSE)

  n <- df %>% 
    summarise(sum(cat == 1)) %>% 
    as.numeric()

  return(n > 2)
}

Update

If we want to return NA for those cases that are not met

ex %>%
    mutate(mean_var2 = map_dbl(bfr, ~ 
    if(is_tibble(.x) && sum(.x$cat == 1) > 2) mean(.x$var2, na.rm = TRUE) else NA))

Or another option is to use possibly (similar to tryCatch)

posmean <- possibly(function(dat) if(sum(dat$cat == 1) > 2) 
     mean(dat$var2, na.rm  = TRUE) else NA_real_, otherwise = NA_real_)
ex %>% 
     mutate(mean_var2 = map_dbl(bfr, posmean))

Omitting NAs and empty data frames with map_if

2 Answers

Update