4
votes

I have recently been work with much larger datasets and have started learning and migrating to data.table to improve performance of aggregation/grouping. I have been unable to get certain expressions or functions to group as expected. Here is an example of a basic group by operation that I am having trouble with.

library(data.table)
category <- rep(1:10, 10)
value <- rnorm(100)
df <- data.frame(category, value)
dt <- data.table(df)

If I want to simply calculate the mean for each group by category. This works easily enough.

dt[,mean(value),by="category"]

    category          V1
 1:        1 -0.67555478
 2:        2 -0.50438413
 3:        3  0.29093723
 4:        4 -0.41684790
 5:        5  0.33921764
 6:        6  0.01970997
 7:        7 -0.23684245
 8:        8 -0.04280998
 9:        9  0.01838804
10:       10  0.44295978

I run into problems if I try and use the scale function or even a simple expression subtracting the value from itself. The grouping is ignored and I get the function/expression applied to each row instead. The following returns all 100 rows instead of 10 group by categories.

dt[,scale(value),by="category"]


dt[,value-mean(value),by="category"]

I thought recreating scale as function that returns a numeric vector instead of a matrix might help.

zScore <- function(x) {
  z=(x-mean(x,na.rm=TRUE))/sd(x,na.rm = TRUE)
  return(z) 
}

dt[,zScore(value),by="category"]

  category          V1
  1:        1 -1.45114132
  2:        1 -0.35304528
  3:        1 -0.94075418
  4:        1  1.44454416
  5:        1  1.39448268
  6:        1  0.55366652
  ....
 97:       10 -0.43190602
 98:       10 -0.25409244
 99:       10  0.35496694
100:       10  0.57323480
     category          V1

This also returns the zScore function applied to all rows (N=100) and ignoring the grouping. What am I missing in order to get scale() or a custom function to use the grouping like it did above when using mean()?

2
mean returns 1 value. scale function returns a scaled value for each input. That is, scale(1:5) gives 5 values. mean(1:5) gives 1 value. Does this help understand your problem?Arun
the function you apply on grouped variables should return 1 value and not vector.CHP
This is exactly the problem. I was trying to use data.table the way I had used ddply. How would I use data.table to equivalently achieve the following. in ddply? ddply(df,"category",transform, zscorebycategory=zScore(value))Aaron
@Aaron dt[,zscorebycategory:=zScore(value),by=category] I don't understand where you see a problem?Roland
@Roland You are right. I didn't quite realize that := behaved a bit differently and updates the table directly. When I had run it before I was expecting to assign it manually to dt. It is working now. Thanks very much for helping me clear this up.Aaron

2 Answers

2
votes

You've clarified in the comments that you'd like the same behaviour as:

ddply(df,"category",transform, zscorebycategory=zScore(value))

which gives:

    category       value zscorebycategory
1          1  0.28860691       0.31565682
2          1  1.17473759       1.33282374
3          1  0.06395503       0.05778463
4          1  1.37825487       1.56643607
etc

The data table option you gave gives:

     category          V1
  1:        1  0.31565682
  2:        1  1.33282374
  3:        1  0.05778463
  4:        1  1.56643607
  etc

Which is exactly the same data. However you'd like to also repeat the value column in your result, and rename the V1 variable with something more descriptive. data.table gives you the grouping variable in the result, along with the result of the expression you provide. So lets modify that to give the rows you'd like:

Your

dt[,zScore(value),by="category"]

becomes:

dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]

Where the named items in the list become columns in the result.

plyr = data.table(ddply(df,"category",transform, zscorebycategory=zScore(value)))
dt   = dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
identical(plyr, dt)
> TRUE

(note I converted your ddply data.frame result into a data.table, to allow the identical command to work).

1
votes

Your claim that data.table does not group is wrong:

library(data.table)
category <- rep(1:2, each=4)
value <- c(rep(c(1:2),each=2),rep(c(4,10),each=2))
dt <- data.table(category, value)

   category value
1:        1     1
2:        1     1
3:        1     2
4:        1     2
5:        2     4
6:        2     4
7:        2    10
8:        2    10

dt[,value-mean(value),by=category]
   category   V1
1:        1 -0.5
2:        1 -0.5
3:        1  0.5
4:        1  0.5
5:        2 -3.0
6:        2 -3.0
7:        2  3.0
8:        2  3.0

If you want to scale/transform this is exactly the behavior you want, because these operations by definition return an object of the same size as the input.