3
votes

Here is the simple problem I'm trying to solve: I have a data.table like following table, and I'm trying to use dcast.data.table function to calculate number of advancement for each group, but also I'm interested to calculate median of grades in each group:

set.seed(10);
DT = data.table(GROUP = sample(c("a","b","c"),100,replace = T), 
                ADVANCED = sample(c("ADVANCED","DROP"),100,replace = T),
                GRADE = sample(1:10,100, replace=T))

     GROUP ADVANCED GRADE
  1:     b ADVANCED     3
  2:     a ADVANCED     6
  3:     b ADVANCED     7
  4:     c ADVANCED     9

 95:     b     DROP     6
 96:     c ADVANCED     5
 97:     a     DROP    10
 98:     b ADVANCED     1
 99:     c     DROP     6
100:     a     DROP     2
     GROUP ADVANCED GRADE

Essentially here is the result I'm looking for:

result = merge(
  dcast.data.table(DT,.Primitive("~")(GROUP,ADVANCED)),
  dcast.data.table(DT,.Primitive("~")(GROUP,.),
                   value.var="GRADE", 
                   fun.aggregate=median));

setnames(result,".","MEDIAN_GRADE")

   GROUP ADVANCED DROP MEDIAN_GRADE
1:     a       17   19            6
2:     b       20   21            7
3:     c       13   10            6

Now I'm wondering how can I do it without making two separate dcast tables and merge at the end. I'm dealing with many row and column in my tables and grouping by key is a bottleneck. I'm wondering is there a better way to calculate this?

** Since my first question was vague I edit completely (thanks to Frank and Akrun for their feedback).

2
You should probably use set.seed before making random examples. - Frank
There's reshape2::dcast(DT, GROUP ~ ADVANCED, margins="ADVANCED") but you lose access to the improvements made in data.table::dcast. It looks like margins never got implemented: github.com/Rdatatable/data.table/issues/1214 - Frank
This looks like a completely different question from the one you posted earlier - akrun
you are right, we could solve previous questions with many different ways . - Mahdi Jadaliha
I agree with you, please accept my apology. - Mahdi Jadaliha

2 Answers

4
votes

For the updated question

setnames(dcast(DT, GROUP~ADVANCED, length)[dcast(DT, GROUP~., median),
            on = "GROUP"], ".", "MEDIAN_GRADE")[]
#   GROUP ADVANCED DROP MEDIAN_GRADE
#1:     a       17   19            6
#2:     b       20   21            7
#3:     c       13   10            6

Or a faster approach would be to group by 'GROUP', get the median of 'GRADE' and then do the join on the dcast output

DT[,.(MEDIAN_GRADE = median(GRADE)) , .(GROUP)][
              dcast(DT, GROUP ~ ADVANCED, length), on = 'GROUP']
3
votes

Well, you could compute in long form and then reshape:

dcast(DT[, rbind(
  .SD[, .(v = .N), by=.(stat = paste0("n.",ADVANCED))],
  .(stat = "med", v = as.numeric(median(GRADE)))
), by=GROUP], GROUP ~ stat)

   GROUP med n.ADVANCED n.DROP
1:     a   6         17     19
2:     b   7         20     21
3:     c   6         13     10

Obviously, this still involves tons of manual fiddling. It also requires that your stats all be numeric (since they are stacked in the stat column together before the dcast). I think the approach in @akrun's answer -- like DT[, f(...), by=GROUP][dcast(DT, GROUP ~ x), on=GROUP] -- is much better, limiting the dcast to only those calls that need it.