1
votes

I am often working with the dplyr package and group_by/summarise/do functions. I often have big datasets and it takes 2 or 3 hours to compute my functions (maybe it can be optimized but this is not the subject).

It happens that after 1.5 hour of calculation, my do function gives an error, because I forgot to take into account one specific case in my code. The only problem is that I do not know which iteration gives this error and generally, I have to create loops to replace my group_by / summarise / do code in order to know what is the data giving a problem.

A really simple example to explain my problem... Cause generally I work with some complicated own-created functions with lots of groups.

require(dplyr)

FUN <- function(x) {
  for (i in 1:which(!is.na(x$value))[1])
  {
    print("TEST")
  }
}

df <- data.frame(ID = c(rep("ID1",10), rep("ID2", 20), rep("ID3", 5)),
                 value= c(sample(1:100, 10), rep(NA, 20), sample(0:50, 5)))

Result <- group_by(df, ID) %>%
  do(Res=FUN(.))

Here I will have an error for the second group (group by ID2) because all values are NA and the loop in FUN can't work. To know that my problem comes from ID2, I would do something like this :

for (j in 1:length(unique(df$ID)))
{
  Interm <- df[df$ID==unique(df$ID)[j],]
  Res <- FUN(Interm)
  print(j)
}

Thanks to this I know that my problem comes from j=2 so from ID2.

This is okay for simple calculation like this but it really takes a long time for my functions. For example, right know my code with group_by/do is giving an error after 45 minutes, I did a two-loop code to know what is the data giving the error and after 1.5 hour, it is still running... When I will find the error, I will just add one line to my function (FUN) to take into account this specific case, re-run the do code and maybe have another error 1h later...

Simple question : Is there a way to know from what data the code is giving an error with the group_by/do code ?

Thanks

2
Have you considered adding tryCatch or purrr::possibly to your function to try to catch your error and then have it return something else in that case? You don't have to know what the error is in advance.aosmith
No, I didn't know the existence of these functions. Will it replace the result of my do by what I want and continue with the next grouping ?Chika

2 Answers

2
votes

Frank's answer is by far the simplest, but here's a swatch of code I've worked on for mid-pipe debugging and such.

Caveat emptor:

  • this code is under-tested;
  • even if well-tested, there is no intention for this to be used in production or unattended use;
  • it has not been blessed or even reviewed by any authors or contributors to dplyr and related packages;
  • it currently works in R-3.4 and dplyr-0.7.4, but it is not taking advantage of many "goodnesses" that should be used, such as rlang and/or lazyeval;
  • it works for my uses, not tested for yours.

Bug reports welcome, if/when you find something amyss.

Mid-pipe message

This can include just about anything you want:

mtcars %>%
  group_by(cyl) %>%
  pipe_message(whichcyl = cyl[1], bestmpg = max(mpg)) %>%
  summarize(mpg=mean(mpg))
# Mid-pipe message (2018-05-01 09:39:26):
#  $ :List of 2
#   ..$ whichcyl: num 4
#   ..$ bestmpg : num 33.9
#  $ :List of 2
#   ..$ whichcyl: num 6
#   ..$ bestmpg : num 21.4
#  $ :List of 2
#   ..$ whichcyl: num 8
#   ..$ bestmpg : num 19.2
# # A tibble: 3 x 2
#     cyl   mpg
#   <dbl> <dbl>
# 1    4.  26.7
# 2    6.  19.7
# 3    8.  15.1

Mid-pipe assert

You can optionally just realize what's going on and look at the data quickly, allowing you to see the moment and then exit out of the pipe:

mtcars %>%
  group_by(cyl) %>%
  pipe_assert(all(mpg > 12), .debug=TRUE) %>%
  summarize(mpg = mean(mpg))
# #
# # all(mpg > 12) is not TRUE ... in Group: cyl:8
# # 'x' is the current data that failed the assertion.
# #
# Called from: pipe_assert(., all(mpg > 12), .debug = TRUE)
# Browse[1]> 
# debug at c:/Users/r2/Projects/StackOverflow/pipe_funcs.R#81: if (identical(x, .x[.indices[[.ind]], ])) {
#     stop(.msg, call. = FALSE)
# } else {
#     .x[.indices[[.ind]], ] <- x
#     return(.x)
# }
# Browse[2]> 
x
# # A tibble: 14 x 11
# # Groups:   cyl [1]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1  18.7    8.  360.  175.  3.15  3.44  17.0    0.    0.    3.    2.
#  2  14.3    8.  360.  245.  3.21  3.57  15.8    0.    0.    3.    4.
#  3  16.4    8.  276.  180.  3.07  4.07  17.4    0.    0.    3.    3.
#  4  17.3    8.  276.  180.  3.07  3.73  17.6    0.    0.    3.    3.
#  5  15.2    8.  276.  180.  3.07  3.78  18.0    0.    0.    3.    3.
#  6  10.4    8.  472.  205.  2.93  5.25  18.0    0.    0.    3.    4.
#  7  10.4    8.  460.  215.  3.00  5.42  17.8    0.    0.    3.    4.
#  8  14.7    8.  440.  230.  3.23  5.34  17.4    0.    0.    3.    4.
#  9  15.5    8.  318.  150.  2.76  3.52  16.9    0.    0.    3.    2.
# 10  15.2    8.  304.  150.  3.15  3.44  17.3    0.    0.    3.    2.
# 11  13.3    8.  350.  245.  3.73  3.84  15.4    0.    0.    3.    4.
# 12  19.2    8.  400.  175.  3.08  3.84  17.0    0.    0.    3.    2.
# 13  15.8    8.  351.  264.  4.22  3.17  14.5    0.    1.    5.    4.
# 14  15.0    8.  301.  335.  3.54  3.57  14.6    0.    1.    5.    8.
# Browse[2]> 
c
# Error: all(mpg > 12) is not TRUE ... in Group: cyl:8

or you can optionally update/change the data; realize that this modifies the data in the pipe, not the source, so is really only good in dev and/or one-off fixes:

mtcars %>%
  group_by(cyl) %>%
  pipe_assert(all(mpg > 12), .debug=TRUE) %>%
  summarize(mpg = mean(mpg))
# #
# # all(mpg > 12) is not TRUE ... in Group: cyl:8
# # 'x' is the current data that failed the assertion.
# #
# Called from: pipe_assert(., all(mpg > 12), .debug = TRUE)
# Browse[1]> 
# debug at c:/Users/r2/Projects/StackOverflow/pipe_funcs.R#81: if (identical(x, .x[.indices[[.ind]], ])) {
#     stop(.msg, call. = FALSE)
# } else {
#     .x[.indices[[.ind]], ] <- x
#     return(.x)
# }

(Ignore the current line of debugged code, if ..., that's my stuff and not beautiful.) I'm in the debugger now, I can look at and alter/fix the data:

# Browse[2]> 
x
# ...as before...
x$mpg <- x$mpg + 1000

If the data is changed, the pipe continues, otherwise it'll stop.

# Browse[2]> 
c
# # A tibble: 3 x 2
#     cyl    mpg
#   <dbl>  <dbl>
# 1    4.   26.7
# 2    6.   19.7
# 3    8. 1015. 

(The data can be changed but the labels cannot ... so if we had done x$cyl <- 99, it still would have shown 8 in rest of the pipe. This is a consequence of dplyr not allowing you to change grouping variables ... which is a good thing, IMO.)

There's also pipe_debug which always debugs, but it is less impressive. It also does not (currently) pass on changed data, so use pipe_assert for that (e.g., pipe_assert(FALSE,.debug=TRUE)).


Source, also available in my gist:

#' Mid-pipe assertions
#'
#' Test assertions mid-pipe. Each assertion is executed individually
#' on each group (if present) of the piped data. Any failures indicate
#' the group that caused the fail, terminating on the first failure.
#'
#' If `.debug`, then the interpreter enters the `browser()`, allowing
#' you to look at the specific data, stored as `x` (just the grouped
#' data if `is.grouped_df(.x)`, all data otherwise). If the data is
#' changed, then the altered data will be sent forward in the pipeline
#' (assuming you fixed the failed assertion), otherwise the assertion
#' will fail (as an assertion should).
#'
#' @param .x data.frame, potentially grouped
#' @param ... unnamed expression(s), each must evaluate to a single
#'   'logical'; similar to [assertthat::assert_that()], rather than
#'   combining expressions with `&&`, separate them by commas so that
#'   better error messages can be generated.
#' @param .msg a custom error message to be printed if one of the
#'   conditions is false.
#' @param .debug logical, whether to invoke [browser()] if the
#'   assertion fails; if `TRUE`, then when the debugger begins on a
#'   fail, the grouped data will be in the variable `x`
#' @return data.frame (unchanged)
#' @export
#' @import assertthat
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#' library(assertthat)
#'
#' mtcars %>%
#'   group_by(cyl) %>%
#'   pipe_assert(
#'     all(cyl < 9),
#'     all(mpg > 10)
#'   ) %>%
#'   count()
#' # # A tibble: 3 x 2
#' #     cyl     n
#' #   <dbl> <int>
#' # 1     4    11
#' # 2     6     7
#' # 3     8    14
#' 
#' # note here that the "4" group is processed first and does not fail
#' mtcars %>%
#'   group_by(cyl, vs) %>%
#'   pipe_assert( all(cyl < 6) ) %>%
#'   count()
#' # Error: all(cyl < 6) is not TRUE ... in Group: cyl:6, vs:0
#'
#' }
pipe_assert <- function(.x, ..., .msg = NULL, .debug = FALSE) {
  if (is.grouped_df(.x)) {
    .indices <- lapply(attr(.x, "indices"), `+`, 1L)
    .labels <- attr(.x, "labels")
  } else {
    .indices <- list(seq_len(nrow(.x)))
  }
  for (assertion in eval(substitute(alist(...)))) {
    for (.ind in seq_along(.indices)) {
      .out <- assertthat::see_if(eval(assertion, .x[.indices[[.ind]],]))
      if (! .out) {
        x <- .x[.indices[[.ind]],]
        if (is.null(.msg)) .msg <- paste(deparse(assertion), "is not TRUE")
        if (is.grouped_df(.x)) {
          .msg <- paste(.msg,
                        paste("in Group:",
                              paste(sprintf("%s:%s", names(.labels),
                                            sapply(.labels, function(z) as.character(z[.ind]))),
                                    collapse = ", ")),
                        sep = " ... ")
        }
        if (.debug) {
          message("#\n", paste("#", .msg), "\n# 'x' is the current data that failed the assertion.\n#\n")
          browser()
        }
        if (identical(x, .x[.indices[[.ind]],])) {
          stop(.msg, call. = FALSE)
        } else {
          .x[.indices[[.ind]],] <- x
          return(.x)
        }
      }
    }
  }
  .x # "unmodified"
}

#' Mid-pipe debugging
#'
#' Mid-pipe peek at the data, named `x` within [browser()], but
#' *changes are not preserved*.
#'
#' @param .x data.frame, potentially grouped
#' @return data.frame (unchanged)
#' @export
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#'
#' mtcars %>%
#'   group_by(cyl, vs) %>%
#'   pipe_debug() %>%
#'   count()
#'
#' }
pipe_debug <- function(.x) {
  if (is.grouped_df(.x)) {
    .indices <- lapply(attr(.x, "indices"), `+`, 1L)
    .labels <- attr(.x, "labels")
  } else {
    .indices <- list(seq_len(nrow(.x)))
  }
  # I used 'lapply' here instead of a 'for' loop because
  # browser-stepping after 'browser()' in a 'for' loop could continue
  # through all of *this* code, not really meaningful; in pipe_assert
  # above, since the next call after 'browser()' is 'stop()', there's
  # little risk of stepping in or out of this not-meaningful code
  .ign <- lapply(seq_along(.indices), function(.ind, .x) {
    x <- .x[.indices[[.ind]],]
    message("#",
            if (is.grouped_df(.x)) {
              paste("\n# in Group:",
                    paste(sprintf("%s:%s", names(.labels),
                                  sapply(.labels, function(z) as.character(z[.ind]))),
                          collapse = ", "),
                    "\n")
            },
            "# 'x' is the current data (grouped, if appropriate).\n#\n")
    browser()
    NULL
  }, .x = .x)
  .x # "unmodified"
}

#' Mid-pipe status messaging.
#'
#' @param .x data.frame, potentially grouped
#' @param ... unnamed or named expression(s) whose outputs will be
#'   captured, aggregated with [utils::str()], and displayed as a
#'   [base::message()]; if present, a '.' literal is replace with a
#'   reference to the `data.frame` (in its entirety, not grouped)
#' @param .FUN function, typically [message()] or [warning()] (for
#'   when messages are suppressed); note: if set to `warning`, the
#'   argument `call.=FALSE` is appended to the arguments
#' @param .timestamp logical, if 'TRUE' then a POSIXct timestamp is
#'   appended to the header of the `str`-like output (default 'TRUE')
#' @param .stropts optional list of options to pass to [utils::str()],
#'   for example `list(max.level=1)`
#' @return data.frame (unchanged)
#' @export
#' @md
#' @examples
#' \dontrun{
#'
#' library(dplyr)
#'
#' mtcars %>%
#'   pipe_message(           # unnamed
#'     "starting",
#'     group_size(.)
#'   ) %>%
#'   group_by(cyl) %>%
#'   pipe_message(           # named
#'     msg  = "grouped",
#'     grps = group_size(.)
#'   ) %>%
#'   count() %>%
#'   ungroup() %>%
#'   pipe_message(           # alternate function, for emphasis!
#'     msg = "done",
#'     .FUN = warning
#'   )
#'
#' head(mtcars) %>%
#'   pipe_message(
#'     list(a = list(aa=1, bb=2, cc=3))
#'   )
#' head(mtcars) %>%
#'   pipe_message(
#'     list(a = list(aa=1, bb=2, cc=3)),
#'     .stropts = list(max.level = 2)
#'   )
#'
#' }
pipe_message <- function(.x, ..., .FUN = message, .timestamp = TRUE, .stropts = NULL) {
  .expressions <- eval(substitute(alist(...)))
  if (is.grouped_df(.x)) {
    .indices <- lapply(attr(.x, "indices"), `+`, 1L)
    .labels <- attr(.x, "labels")
  } else {
    .indices <- list(seq_len(nrow(.x)))
    .labels <- ""
  }
  lst <- mapply(function(.ind, .lbl) {
    .x <- .x[.ind,,drop=FALSE]
    lapply(.expressions, function(.expr) {
      if (is.call(.expr)) .expr <- as.call(lapply(.expr, function(a) if (a == ".") as.symbol(".x") else a))
      eval(.expr, .x)
    })
  }, .indices, .labels, SIMPLIFY=FALSE)
  .out <- capture.output(
    do.call("str", c(list(lst), .stropts))
  )
  .out[1] <- sprintf("Mid-pipe message%s:",
                     if (.timestamp) paste(" (", Sys.time(), ")", sep = ""))
  do.call(.FUN, c(list(paste(.out, collapse = "\n")),
                  if (identical(.FUN, warning)) list(call. = FALSE)))
  .x # "unmodified"
}
1
votes

You can still do the printing thing here:

df %>% group_by(ID) %>% do({
  the_id = unique(.$ID)
  cat("Working on...", the_id, "which is...", match(the_id, unique(df$ID)), "/", n_distinct(df$ID), "\n")
  FUN(.)
})

which prints

Working on... 1 which is... 1 / 3 
[1] "TEST"
Working on... 2 which is... 2 / 3 
Error in 1:which(!is.na(x$value))[1] : NA/NaN argument

I routinely do this (using data.table not dplyr, but the same idea). I realize there are more sophisticated ways to debug, but it's worked well enough for me.