How can I preserve one of the grouping names when using dply::summarise? Or, is there a better way to preserve one of the group names? I am probably going about this rather inefficiently.
I have a data.frame (df) as such:
dput(head(df, n = 20))
structure(list(file_src = c("CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx",
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx",
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx",
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx",
"CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx", "CBG_EFD.xlsx",
"CBG_EFD.xlsx"), AU = c("CBD", "CBD", "CBD", "CBD", "CBD", "CBD",
"CBD", "CBD", "CBD", "CBD", "CBD", "CBD", "CBD", "CBD", "CBD",
"CBD", "CBD", "CBD", "CBD", "CBD"), BU = c("OAO", "Constr", "Retail",
"OAO", "Constr", "Retail", "OAO", "Constr", "Retail", "OAO",
"Constr", "Retail", "OAO", "Constr", "Retail", "OAO", "Constr",
"Retail", "OAO", "Constr"), CC = c("AUDIT", "AUDIT", "AUDIT",
"AUDIT", "AUDIT", "AUDIT", "CORC", "CORC", "CORC", "CORC", "CORC",
"CORC", "CORC", "CORC", "CORC", "CORC", "CORC", "CORC", "CORC",
"CORC"), CA_LVL = c("AUDIT01", "AUDIT01", "AUDIT01", "AUDIT02",
"AUDIT02", "AUDIT02", "CORC01", "CORC01", "CORC01", "CORC02",
"CORC02", "CORC02", "CORC03", "CORC03", "CORC03", "CORC04", "CORC04",
"CORC04", "CORC05", "CORC05"), Score = c(1, 1, 2, 1, 3, 3, 1,
3, 2, 2, 4, 2, 2, 3, 1, 4, 2, 3, 3, 2)), .Names = c("file_src",
"AU", "BU", "CC", "CA_LVL", "Score"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Defintions Where AU is a set of five (5) 'groups' and BU is a set of fifty-five (55) units all of which belong to one of the five AUs. Parent-child relationship. Score is a raw number 0-4. Control_Category is a variable of which there are six (string values).
Currently, my code is broken out such that the script performs two levels of grouping and aggregating of scores to give a simple average. I group first at the AU level to get simple averages across all of the units of a given group for their respective Categories (CC). In the end, I have five data.frames (cbg.au.stat.wide, cbd.au.stat.wide, etc). These dfs represent the average score for a given category across all units of a given group.
# Group1 assessment unit scores
cbg.au.stat.wide <- df %>%
group_by(AU, CC) %>%
filter(AU == "CBG") %>%
summarise(avg = mean(Score, na.rm = TRUE)) %>%
dcast(AU ~ CC, value.var = "avg") %>%
print() # end chain
Which produces:
cbg.au.stat.wide
AU AUDIT CORC GOV PPS TMSC TRAIN
1 CBG 3 2.733333 2.2 2.666667 1.583333 2.666667
Later, all of the 'AU level' data frames are combined using dplyr::bind_rows
au.avg.scores <- bind_rows(
bsa.au.stat.wide,bsg.au.stat.wide,cbd.au.stat.wide,
cbg.au.stat.wide,wmg.au.stat.wide)
au.avg.scores
AU AUDIT CORC GOV PPS TMSC TRAIN
1 BSA Admin 2.833333 2.000000 2.733333 2.000000 1.750000 2.333333
2 BSG 2.833333 0.000000 2.733333 2.000000 1.750000 2.333333
3 CBD 1.833333 2.533333 2.466667 2.000000 2.500000 2.166667
4 CBG 3.000000 2.733333 2.200000 2.666667 1.583333 2.666667
5 WMG 2.625000 1.816667 2.533333 2.166667 1.895833 2.375000
Then I perform a similar grouping and summarizing activity. Only this time instead of being at the AU level (parent) I do it at the BU level for each Category (CC). So, then for a given AU I know have a table of BU average scores across their Control Categories.
# Group1 business units by Control Category
cbg.bu.stat.wide <- df %>%
group_by(BU, CC) %>%
filter(AU == "CBG") %>%
summarise(avg = mean(Score, na.rm = TRUE)) %>%
dcast(BU ~ CC, value.var = "avg") %>%
print() # end chain
Which produces:
BU AUDIT CORC GOV PPS TMSC TRAIN
1 Capital Markets 3 3.2 1.6 4 1.00 3
2 EFD 4 2.6 1.6 3 1.75 3
3 Global Trade Solutions 3 2.4 3.4 1 2.00 2
4 Investigations 1 NA NA NA NA NA
What I think you'll notice here at the 'BU' level is that the 'AU' level has been dropped. Eventually, I'd like to combine all of these BUs into a big table that shows the BU and the AU from whence it originated
So that it would end up looking like this:
> bu.avg.scores
AU BU AUDIT CORC GOV PPS TMSC TRAIN
CBG Adherence 3.0 1.4 3.2 1 1.50 3.0
CBG CTR 2.0 2.8 2.0 4 1.50 2.5
CBG HRCU 3.5 1.8 3.0 1 2.25 1.5
CBD Investigations 2.0 NA NA NA NA NA
BSG ACH 2.0 0.0 2.0 4 1.50 2.5