3
votes

So I've started to dip my toes into the wonderful world of dplyr programming. I'm trying to write a function that accepts a data.frame, a target column, and any number of grouping columns (using bare names for all columns). The function will then bin the data based on the target column and count the number of entries in each bin. I want to keep a separate bin size for every combination of the grouping variables present in my original data.frame(), so I'm using the complete() and nesting() functions to do this. Here's an example of what I'm trying to do and the error I'm running into:

library(dplyr)
library(tidyr)

#Prepare test data
set.seed(42)
test_data =
    data.frame(Gene_ID = rep(paste0("Gene.", 1:10), times=4),
               Comparison = rep(c("WT_vs_Mut1", "WT_vs_Mut2"), each=10, times=2),
               Test_method = rep(c("T-test", "MannWhitney"), each=20),
               P_value = runif(40))

#Perform operation manually
test_data %>% 
    #Start by binning the data according to q-value
    mutate(Probability.bin = cut(P_value,
                                 breaks = c(-Inf, seq(0.1, 1, by=0.1), Inf),
                                 labels = c(seq(0.0, 1.0, by=0.1)),
                                 right = FALSE)) %>% 
    #Now summarize the results by bin.
    count(Comparison, Test_method, Probability.bin) %>% 
    #Fill in any missing bins with 0 counts
    complete(nesting(Comparison, Test_method), Probability.bin,
             fill=list(n = 0))

#Create function that accepts bare column names
bin_by_p_value <- function(df,
                           pvalue_col, #Bare name of p-value column
                           ...) {      #Bare names of grouping columns

    #"Quote" column names so they are ready for use below
    pvalue_col_name <- enquo(pvalue_col)
    group_by_cols <- quos(...)

    #Perform the operation
    df %>% 
        #Start by binning the data according to q-value
        mutate(Probability.bin = cut(UQ(pvalue_col_name),
                                     breaks = c(-Inf, seq(0.1, 1, by=0.1), Inf),
                                     labels = c(seq(0.0, 1.0, by=0.1)),
                                     right = FALSE)) %>% 
        #Now summarize the results by bin.
        count(UQS(group_by_cols), Probability.bin) %>% 
        #Fill in any missing bins with 0 counts
        complete(nesting(UQS(group_by_cols)), Probability.bin,
                 # complete(nesting(UQS(group_by_cols)), Probability.bin,
                 fill=list(n = 0))
}

#Use function to perform operation
test_data %>% 
    bin_by_p_value(P_value, Comparison, Test_method)

When I perform the operation manually, everything works fine. When I use the function, it fails with this error:

Error in overscope_eval_next(overscope, expr) : object 'Comparison' not found

I've narrowed down the problem to the following piece of code in the function:

complete(nesting(UQS(group_by_cols)), Probability.bin...

If I remove the call to nesting(), the code executes without the error. However, I want to maintain the functionality where I only use combinations of the grouping variables that are present in the original data, and then get all possible combinations with the bins, so I can fill in all of the missing bins. Based on the error name and where this is failing, my guess is this is a scoping/environment issue, where I really should use a different environment for the grouping variables in nesting(), since it's contained inside the call to complete(). However, I'm new enough to dplyr programming, that I'm not sure how to do that.

I tried to work around this by uniting the grouping columns into a single column, and then using that united column as input into complete(). This lets me perform the complete() operation the way I want to, while avoiding the nesting() function. However, I ran into trouble when I wanted to separate back into the original grouping columns, since I don't know how to convert a list of quosures into a character vector (required for the "into" parameter of separate()). Here are code snippets to illustrate what I'm talking about:

        #Fill in any missing bins with 0 counts
        unite(Merged_grouping_cols, UQS(group_by_cols), sep="*") %>% 
        complete(Merged_grouping_cols, Probability.bin,
                 fill=list(n = 0)) %>%
        separate(Merged_grouping_cols, into=c("What goes here?"), sep="\\*")

Here's the pertinent version info: R version 3.4.2 (2017-09-28), tidyr_0.7.2, dplyr_0.7.4

I'd appreciate any workarounds, but I want to know what I'm doing that's rubbing complete() and nesting() the wrong way.

1

1 Answers

1
votes
  • Use curly-curly {{}} for pvalue_col.
  • Pass the dots (...) directly to count.
  • Use ensyms with !!! in nesting.
bin_by_p_value <- function(df,
                           pvalue_col, #Bare name of p-value column
                           ...) {      #Bare names of grouping columns
  
  #Perform the operation
  df %>% 
    #Start by binning the data according to q-value
    mutate(Probability.bin = cut({{pvalue_col}},
                                 breaks = c(-Inf, seq(0.1, 1, by=0.1), Inf),
                                 labels = c(seq(0.0, 1.0, by=0.1)),
                                 right = FALSE)) %>% 
    #Now summarize the results by bin.
    count(..., Probability.bin) %>% 
    #Fill in any missing bins with 0 counts
    complete(nesting(!!!ensyms(...)), Probability.bin,   fill=list(n = 0))
}

test_data %>% bin_by_p_value(P_value, Comparison, Test_method)

# A tibble: 44 x 4
#   Comparison Test_method Probability.bin     n
#   <chr>      <chr>       <fct>           <dbl>
# 1 WT_vs_Mut1 MannWhitney 0                   1
# 2 WT_vs_Mut1 MannWhitney 0.1                 1
# 3 WT_vs_Mut1 MannWhitney 0.2                 0
# 4 WT_vs_Mut1 MannWhitney 0.3                 1
# 5 WT_vs_Mut1 MannWhitney 0.4                 1
# 6 WT_vs_Mut1 MannWhitney 0.5                 1
# 7 WT_vs_Mut1 MannWhitney 0.6                 0
# 8 WT_vs_Mut1 MannWhitney 0.7                 0
# 9 WT_vs_Mut1 MannWhitney 0.8                 1
#10 WT_vs_Mut1 MannWhitney 0.9                 4
# … with 34 more rows

Testing the output if the output of manual call is stored in res.

identical(res, test_data %>% bin_by_p_value(P_value, Comparison, Test_method))
#[1] TRUE