0
votes

I'm trying to find a better and faster way to put together a table of summary statistics comprised of weighted averages. Using dplyr to summarise and then bind_rows I end up with a table like this. These numbers are simple averages The averages are calculated for each factor for each group.

Dataframe: au.scores

         AU    AUDIT     CORC      GOV      PPS     TMSC    TRAIN
1 Group1 2.833333 2.000000 2.733333 2.000000 1.750000 2.333333
2 Group2 2.833333 0.000000 2.733333 2.000000 1.750000 2.333333
3 Group3 1.833333 2.533333 2.466667 2.000000 2.500000 2.166667
4 Group4 3.000000 2.733333 2.200000 2.666667 1.583333 2.666667
5 Group5 2.625000 1.816667 2.533333 2.166667 1.895833 2.375000

Subsequent to this I need to derive a weighted score that combines the elements of each variable and groups 1 & 2 with 3, 4, 5. I.e., Overall.Group1 is Group1+Group4+Group5, Group2 is Group2+Group4+Group5 and Group3 is Group3+Group4+Group5 factors.

group1.overall <- data.frame(
  group1.gov = (au.scores[3, 4] * .30) * .33 + (au.scores[1, 4] * .30) * .33 +
    (au.scores[2, 4] * .30) * .33,
  group1.corc = (au.scores[3, 3] * .30) * .33 + (au.scores[1, 3] * .1) * .33 +
    (au.scores[2, 3] * .1) * .33,
  group1.tmsc = (au.scores[3, 6] * .30) * .33 + (au.scores[1, 6] * .30) * .33 +
    (au.scores[2, 6] * .30) * .33,
  group1.audit = (au.scores[3, 2] * .30) * .33 + (au.scores[1, 2] * .30) * .33 +
    (au.scores[2, 2] * .30) * .33,
  group1.pps = (au.scores[3, 5] * .30) * .33 + (au.scores[1, 5] * .30) * .33 +
    (au.scores[2, 5] * .30) * .33,
  group1.train = (au.scores[3, 7] * .30) * .33 + (au.scores[1, 7] * .30) * .33 +
    (au.scores[2, 7] * .30) * .33
)

Produces

  group1.gov group1.corc group1.tmsc group1.audit group1.pps group1.train
1  0.7854   0.3168    0.594    0.7425   0.594    0.6765

Question Is there a quicker way to create the data.frame of overall scores?

Something like

Group_Num / Gov / Corc / Tmsc / Audit / PPS / Train / Overall
Group1 / 0.78 / 0.31 / 0.59 / 0.74 / 0.59 / 0.67 / <- sum these 
Group2 / 0.66 / 0.23 / 0.44 / 0.66 / 0.22 / 0.43 / <- sum these
Group3 / 0.12 / 0.55 / 0.22 / 0.33 / 0.11 / 0.55 / <- sum these

etc

1
Have you tried data.table. It is pretty quick (much quicker than data.frame)Duy Bui

1 Answers

1
votes

Overall.Group1 is Group1+Group4+Group5, Group2 is Group2+Group4+Group5 and Group3 is Group3+Group4+Group5 factors.

Your description of how overall scores are calculated is different from your formula for group1.overall, which uses Group1 <- Group1+Group2+Group3 instead. In the approach below, I'm going by the description. You can tweak that if necessary:

library(dplyr); library(tidyr); library(tibble)

# read in au.scores data frame
au.scores <- read.table(text = "AU    AUDIT     CORC      GOV      PPS     TMSC    TRAIN
Group1 2.833333 2.000000 2.733333 2.000000 1.750000 2.333333
Group2 2.833333 0.000000 2.733333 2.000000 1.750000 2.333333
Group3 1.833333 2.533333 2.466667 2.000000 2.500000 2.166667
Group4 3.000000 2.733333 2.200000 2.666667 1.583333 2.666667
Group5 2.625000 1.816667 2.533333 2.166667 1.895833 2.375000", header = T)

# create table of weights (these are dummy weights since there's insufficient details in the question)
weight.table <- tribble(
  ~AU, ~GOV, ~CORC, ~TMSC, ~AUDIT, ~PPS, ~TRAIN,
  "Group1",.30,.10,.30,.30,.30,.30,
  "Group2",.30,.10,.30,.30,.30,.30,
  "Group3",.30,.10,.30,.30,.30,.30,
  "Group4",.30,.30,.30,.30,.30,.30,
  "Group5",.30,.10,.30,.30,.30,.30
)

# arrange columns in au.scores to match order of columns in weight.table
au.scores <- au.scores %>% arrange(AU, GOV, CORC, TMSC, AUDIT, PPS, TRAIN)

# calculate weighted scores
au.scores.weighted <- au.scores[,-1] * weight.table[,-1]
au.scores.weighted$AU <- au.scores$AU

# calculate scores for each group
au.scores.weighted <- au.scores.weighted %>%
  gather(category, weighted.score, -AU) %>%
  group_by(category) %>%
  arrange(AU) %>%
  summarise(group1 = weighted.mean(weighted.score, c(1,0,0,1,1)) * 3 * 0.33,
            group2 = weighted.mean(weighted.score, c(0,1,0,1,1)) * 3 * 0.33,
            group3 = weighted.mean(weighted.score, c(0,0,1,1,1)) * 3 * 0.33) %>%
  ungroup()

# rearrange result & calculate overall sum for each group
au.scores.weighted <- au.scores.weighted %>%
  gather(group, score, -category) %>%
  spread(category, score) %>%
  select(group, GOV, CORC, TMSC, AUDIT, PPS, TRAIN) %>%
  mutate(Overall = GOV + CORC + TMSC + AUDIT + PPS + TRAIN)

# A tibble: 3 × 8
   group       GOV    CORC      TMSC    AUDIT       PPS     TRAIN  Overall
   <chr>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
1 group1 0.7391999 0.39655 0.5176874 0.837375 0.6765001 0.7301250 3.897437
2 group2 0.7391999 0.33055 0.5176874 0.837375 0.6765001 0.7301250 3.831437
3 group3 0.7128000 0.41415 0.5919374 0.738375 0.6765001 0.7136251 3.847388

Edit to add explanation of code, based on OP's question:

What is the significance of the order of the vectors in the summarise function? c(1,0,0,1,1)) * 3 * 0.33 && c(0,1,0,1,1)) * 3 * 0.33 && c(0,0,1,1,1))?

The earlier step already arranged groups in order, within each category, so using weights c(1, 0, 0, 1, 1) in the weighted.mean function is equivalent to calculating the mean for groups 1, 4, & 5, without using groups 2 & 3 at all. Ditto c(0,1,0,1,1) = mean of groups 2, 4, & 5, `c(0,0,1,1,1) = mean of groups 3, 4, & 5. I find this easier to read / error check than specifying each group manually, which can quickly bury the group numbers in a pile of text.

The mean thus derived is equivalent to (sum of groups) / 3, or (sum of groups) * 0.3333333333333333... in the decimal system, since 1/3 is a recurring fraction. Since your original formula uses (sum of groups) * 0.33 (rounded off at 2 decimal places), multiplying the mean with * 3 * 0.33 would produce the same result. If you prefer the more precise result, you can leave out the * 3 * 0.33 part entirely.