4
votes

There are many posts which discuss applying a function over many columns when using data.table. However I need to calculate a function which depends on many columns. As an example:

# Create a data table with 26 columns.  Variable names are var1, ..., var 26
data.mat = matrix(sample(letters, 26*26, replace=TRUE),ncol=26)
colnames(data.mat) = paste("var",1:26,sep="")
data.dt <- data.table(data.mat)

Now, say I would like to count the number of 'a's in columns 5,6,7 and 8. I cannot see how to do this with SDcols and end up doing:

data.dt[,numberOfAs := (var5=='a')+(var6=='a')+(var7=='a')+(var7=='a')]

Which is very tedious. Is there a more sensible way to do this?

Thanks

1
Seems like your data may be in the wrong shape (wide vs long). Are you sure this is the best structure? - MrFlick

1 Answers

9
votes

I really suggest going through the vignettes linked here. Section 2e from the Introduction to data.table vignette explains .SD and .SDcols.

.SD is just a data.table containing the data for current group. And .SDcols tells the columns .SD should have. A useful way is to use print to see the content.

# .SD contains cols 5:8
data.dt[, print(.SD), .SDcols=5:8]

Since there is no by here, .SD contains all the rows of data.dt, corresponding to the columns specified in .SDcols.


Once you understand this, the task reduces to your knowledge of base R really. You can accomplish this in more than one way.

data.dt[, numberOfAs := rowSums(.SD == "a"), .SDcols=5:8]

We return a logical matrix by comparing all the columns in .SD to "a". And then use rowSums to sum them up.

Another way using Reduce:

data.dt[, numberOfAs := Reduce(`+`, lapply(.SD, function(x) x == "a")), .SDcols=5:8]