1
votes

I wish to pass user input variables to group_by() and summarize() functions.

The direct example of the data frame and code is below. Here I am 'hard-coding' the column names.

library(dplyr)
df <- data.frame('Category' = c('a','c','a','a','b','a','b','b'), 
             'Amt' = c(100,300,200,400,500,1000,350,250), 
             'Flag' = c(0,1,1,1,0,1,1,0))
rowCount <- nrow(df)
totalAmt <- sum(df$Amt)
g <- group_by(df, Category)
summ <- summarize(g, Count = n(), CountPercentage = n()*100/rowCount, TotalAmt = sum(Amt), AmtPercentage = sum(Amt)*100/totalAmt, FlagSum = sum(Flag))
summ

The output is below

enter image description here

In the application I am developing, the dataframe and hence the columns names will be user-defined. I will be reading the .csv file name, the column(s) to be grouped on and the columns to be summarized on from an Excel file.

I have searched far and wide and after spending much time reading and experimenting, I found the solution as shown below which worked for me. I have not used piping to make the steps clearer.

#The data frame df is read from the .csv file name
#Variables read from the Excel file
groupby <- 'Category'
sumBy1 <- 'Amt'
sumBy2 <- 'Flag'

rowCount <- nrow(df)
totalAmt <- sum(df[sumBy1])

g <- group_by_(df, groupby) #group by variable  #grouping

summcount <- summarize(g, Count = n(), CountPercentage = n()*100/rowCount) #summarize counts  #piece 1

summamt <- summarize_at(g, .vars = sumBy1, .funs=sum) #summarize by first variable
summamt <- summamt[-1] #remove first column to remove duplicate column
summamt$AmtPercentage <- summamt[sumBy1]*100/totalAmt  #piece 2

summflag <- summarize_at(g, .vars = sumBy2, .funs=sum) #summarize by second variable
summflag <- summflag[-1] #remove first column to remove duplicate column #piece 3

summ <- cbind(summcount, summamt, summflag)  #combine dataframes
summ

The result is the same as above. As you can see I am creating the final dataframe piecemeal and then binding them. The code is ugly. Also, how do I define the column headers in this syntax? I did consider summarize_all() but that requires creating a subset of the data frame. I have already read the following questions and they did not work for me

Passing arguments to dplyr summarize function

Summarizing data in table by group for each variable in r

Can you recommend a simpler and more elegant way to do this?

Above I have 'hardcoded' two types of summarization, viz. count and sum. To add another level of complication, what if the user wants to also define the type of summarization (viz. sum, mean, count, etc.) required? In the Excel file, I can capture the type of summarization needed against each variable.

Thanks for any suggestions.

1

1 Answers

2
votes

That sounds like a job for Superman! Or at least quasi-quotations.

You want to insert variables using the bang-bang operator, !!.

You can do it like this

# Make a variable symbol from strings
make_var <- function(prefix, var, suffix) 
    as.symbol(paste0(prefix, var, suffix))
calc_summary <- function(df, groupby, sumBy1, sumBy2) {
    totalSumBy1      <- make_var("Total", sumBy1, "")
    sumBy1Percentage <- make_var("", sumBy1, "Percentage")
    sumBy1           <- make_var("", sumBy1, "")
    sumBy2Sum        <- make_var("", sumBy2, "Sum")
    sumBy2           <- make_var("", sumBy2, "")

    group_by_(df, groupby) %>%
        summarize(Count = n(), 
                  CountPercentage = n()*100/rowCount,
                  !!totalSumBy1 := sum(!!sumBy1),
                  !!sumBy2Sum := sum(!!sumBy2)) %>%
        mutate(CountPercentage = Count/sum(Count),
               !!sumBy1Percentage := 100 * !!totalSumBy1 / sum(!!totalSumBy1)) 
}

When you use !! you are inserting the value of a variable, so this is how you can parameterise expressions given to dplyr functions. You need them as symbols, which is why I use the make_var function. It can be done more elegantly, but this will give you the variables you used in your example.

Notice that when the variables we assign to are dynamic we must use the := assignment instead of =. Otherwise, the parser complains.

You can use this function as such:

> df %>% calc_summary("Category", "Amt", "Flag")
# A tibble: 3 x 6
  Category Count CountPercentage TotalAmt FlagSum AmtPercentage
  <fct>    <int>           <dbl>    <dbl>   <dbl>         <dbl>
1 a            4           0.500    1700.      3.         54.8 
2 b            3           0.375    1100.      1.         35.5 
3 c            1           0.125     300.      1.          9.68

The order of columns is not the same as in your example, but you can fix that using select. I cleaned up the percentage calculations a bit by moving those to a mutate after the summary. It removes the need for the rowCount variable. If you prefer, you can easily use that variable and avoid the mutate call. Then you can also get the columns in the order you want in the summarise call.

Anyway, the important point is that you want the bang-bang operator for what you are doing here.