I wish to pass user input variables to group_by() and summarize() functions.
The direct example of the data frame and code is below. Here I am 'hard-coding' the column names.
library(dplyr)
df <- data.frame('Category' = c('a','c','a','a','b','a','b','b'),
'Amt' = c(100,300,200,400,500,1000,350,250),
'Flag' = c(0,1,1,1,0,1,1,0))
rowCount <- nrow(df)
totalAmt <- sum(df$Amt)
g <- group_by(df, Category)
summ <- summarize(g, Count = n(), CountPercentage = n()*100/rowCount, TotalAmt = sum(Amt), AmtPercentage = sum(Amt)*100/totalAmt, FlagSum = sum(Flag))
summ
The output is below
In the application I am developing, the dataframe and hence the columns names will be user-defined. I will be reading the .csv file name, the column(s) to be grouped on and the columns to be summarized on from an Excel file.
I have searched far and wide and after spending much time reading and experimenting, I found the solution as shown below which worked for me. I have not used piping to make the steps clearer.
#The data frame df is read from the .csv file name
#Variables read from the Excel file
groupby <- 'Category'
sumBy1 <- 'Amt'
sumBy2 <- 'Flag'
rowCount <- nrow(df)
totalAmt <- sum(df[sumBy1])
g <- group_by_(df, groupby) #group by variable #grouping
summcount <- summarize(g, Count = n(), CountPercentage = n()*100/rowCount) #summarize counts #piece 1
summamt <- summarize_at(g, .vars = sumBy1, .funs=sum) #summarize by first variable
summamt <- summamt[-1] #remove first column to remove duplicate column
summamt$AmtPercentage <- summamt[sumBy1]*100/totalAmt #piece 2
summflag <- summarize_at(g, .vars = sumBy2, .funs=sum) #summarize by second variable
summflag <- summflag[-1] #remove first column to remove duplicate column #piece 3
summ <- cbind(summcount, summamt, summflag) #combine dataframes
summ
The result is the same as above. As you can see I am creating the final dataframe piecemeal and then binding them. The code is ugly. Also, how do I define the column headers in this syntax? I did consider summarize_all() but that requires creating a subset of the data frame. I have already read the following questions and they did not work for me
Passing arguments to dplyr summarize function
Summarizing data in table by group for each variable in r
Can you recommend a simpler and more elegant way to do this?
Above I have 'hardcoded' two types of summarization, viz. count and sum. To add another level of complication, what if the user wants to also define the type of summarization (viz. sum, mean, count, etc.) required? In the Excel file, I can capture the type of summarization needed against each variable.
Thanks for any suggestions.