2
votes

I'm working with a tibble like below:

ex <- structure(list(rowid = c(4L, 5L, 6L, 9L, 10L), timestamp = structure(c(1502480694.03336, 
1502480695.44736, 1502480696.03336, 1502480703.99836, 1502480706.19936
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), cat = c(32L, 
1L, 1L, 1L, 1L), var1 = structure(c(NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_), .Label = "1", class = "factor"), 
    var2 = c(0, 50, 29.7, 51, 70.8), var3 = c(NA, 26.3, 24, 20.5, 
    12), order = c(NA, 1L, 1L, 1L, 1L), bfr = list(NA, structure(list(
        rowid = integer(0), timestamp = structure(numeric(0), class = c("POSIXct", 
        "POSIXt"), tzone = "UTC"), cat = integer(0), var1 = structure(integer(0), .Label = "1", class = "factor"), 
        var2 = numeric(0), var3 = numeric(0), order = integer(0)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = integer(0)), structure(list(
        rowid = 5L, timestamp = structure(1502480695.44736, class = c("POSIXct", 
        "POSIXt"), tzone = "UTC"), cat = 1L, var1 = structure(NA_integer_, .Label = "1", class = "factor"), 
        var2 = 50, var3 = 26.3, order = 1L), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -1L)), structure(list(
        rowid = 5:8, timestamp = structure(c(1502480695.44736, 
        1502480696.03336, 1502480699.03336, 1502480701.03336), class = c("POSIXct", 
        "POSIXt"), tzone = "UTC"), cat = c(1L, 1L, 1L, 1L), var1 = structure(c(NA_integer_, 
        NA_integer_, NA_integer_, NA_integer_), .Label = "1", class = "factor"), 
        var2 = c(50, 29.7, 52.8, 44), var3 = c(26.3, 24, 8.9, 
        12.4), order = c(1L, 1L, 1L, 1L)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -4L)), structure(list(
        rowid = 5:9, timestamp = structure(c(1502480695.44736, 
        1502480696.03336, 1502480699.03336, 1502480701.03336, 
        1502480703.99836), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
        cat = c(1L, 1L, 1L, 1L, 1L), var1 = structure(c(NA_integer_, 
        NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = "1", class = "factor"), 
        var2 = c(50, 29.7, 52.8, 44, 51), var3 = c(26.3, 24, 
        8.9, 12.4, 20.5), order = c(1L, 1L, 1L, 1L, 1L)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -5L)))), row.names = c(4L, 
5L, 6L, 9L, 10L), class = "data.frame")

I want to summarise nested tibbles in column bfr with map. To omit unnecessary calculations, I want to go with map_if which would skip the row when bfr contains less than 2 rows with cat == 1. However due to presence of NAs and empty tibbles in bfr column, I'm struggling with writing appropriate predicate function. Here's my attempt:

more_than <- function(df){
  if (nrow(df) == 0 | is.na(df)) return(FALSE)

  n <- df %>% 
    summarise(sum(cat == 1)) %>% 
    as.numeric()

  return(n > 2)
}

ex %>% 
  mutate(mean_var2 = map_if(bfr, more_than, 
                            ~.x %>% summarise(mean_var2 = mean(var2))))

which results in:

Error in if (nrow(df) == 0 | is.na(df)) return(FALSE) : argument is of length zero

How can I deal with the presence of both NAs and empty tibbles to write appropriate predicate function?

2
The issue is with is.na(df), which does the NA check on the whole data while nrow is a single output - akrun
Also, in the more_than, you are doing some other calculation which you are not getting as output in the mean_var2 - akrun
Sorry, I don't get your first comment - could you possibly elaborate in your answer? more_than is just a predicate to avoid unneccesary calculations for some elements of bfr column. - jakes

2 Answers

2
votes

If the intention is to get the mean of 'var2' column, check the list elements are either data.frame or tibble (in this case it is a tibble) and then do the summarise

out <-  ex %>% 
           mutate(mean_var2 = map_if(bfr, is.tibble, ~ 
             .x %>% 
                summarise(mean_var2 = mean(var2, na.rm = TRUE))))

If we also need to check sum(cat ==1) > 2

more_than <- function(df){
i1 <- is_tibble(df)
if(i1) {
   n <- df %>% 
    summarise(v1 = sum(cat == 1))  %>%
    pull(v1) 
    }

    i1 && (n > 2)


}
ex %>%
  mutate(mean_var2 = map_if(bfr, more_than, ~
      .x %>%
         summarise(mean_var2 = mean(var2, na.rm = TRUE))))

The reason why is.na is not working is because it checks for each dataset and in some of them it is a tibble and this returns a logical matrix, while if/else expects a single TRUE/FALSE to return. For e.g.

(3 == 4) & (cbind(3:5, 1:3) == 3)

yields a different output

One option is to use &&, so that it evaluates the rhs condition only if the first condition is TRUE and thereby avoiding unncessary evaluation

(3 == 4) && (cbind(3:5, 1:3) == 3)
#[1] FALSE

In the OP's original function, if we replace the | with || it should work fine

more_than <- function(df){
  if (nrow(df) == 0 || is.na(df)) return(FALSE)

  n <- df %>% 
    summarise(sum(cat == 1)) %>% 
    as.numeric()

  return(n > 2)
}

Update

If we want to return NA for those cases that are not met

ex %>%
    mutate(mean_var2 = map_dbl(bfr, ~ 
    if(is_tibble(.x) && sum(.x$cat == 1) > 2) mean(.x$var2, na.rm = TRUE) else NA))

Or another option is to use possibly (similar to tryCatch)

posmean <- possibly(function(dat) if(sum(dat$cat == 1) > 2) 
     mean(dat$var2, na.rm  = TRUE) else NA_real_, otherwise = NA_real_)
ex %>% 
     mutate(mean_var2 = map_dbl(bfr, posmean))
1
votes

first, we need to check for NA's with || "see the difference between | and || here" before we check nrow. Then we need .else which is:

.else A function applied to elements of .x for which .p returns FALSE.

when more_than returns FLASE

more_than <- function(df){
 # browser()
  if (all(is.na(df)) || nrow(df) == 0) return(FALSE)

     n <- df %>%
       summarise(sum(cat == 1)) %>%
       as.numeric()

     return(n > 2)
}

ex %>% 
mutate(mean_var2 = map_if(bfr, more_than, 
                          ~.x %>% summarise(mean_var2 = mean(var2,na.rm = TRUE)),
                         .else = ~return(NA))) %>% 
select(mean_var2)

   mean_var2
1        NA
2        NA
3        NA
4    44.125
5      45.5