0
votes

I would like to use aggregate to calculate simple summary metrics over multiple columns, based on certain conditions, e.g. only for those rows that are > 0. subsetting the data for values > 0 prior to using aggregate obviously won't work, as that will remove the entire row for all columns, even if only one zero occurs. See the following code for illustration:

idA <- c("A","A","A","A","A","B","B","B","B","B")
idB <- c("C","D","C","D","C","D","C","D","C","D")
colA <- c(0,2,3,0,0,3,9,5,6,1)
colB <- c(9,3,0,2,2,4,6,1,9,9)
colC <- c(0,0,5,7,3,9,8,1,2,3)

df <- data.frame(idA,idB,colA,colB,colC)

aggregate(.~idA+idB,df,FUN=NROW)

The aggregate command in this form, of course, makes no sense, as all the columns have an equal number of rows.

This is the result I'm looking for:

idA  idB  colA  colB  colC
A    C    1     2     2
B    C    2     2     2
A    D    1     2     1
B    D    3     3     3

Hence the need for a conditional statement, including only rows > 0. Alternatively, I'm sure there is a smart way of doing this via data.table. Any help would be much appreciated!

1
aggregate(.~ idA + idB, df , function(x) length(x[x > 0])) - David Arenburg
Thanks @DavidArenburg! If you post your comment as an answer, I'll tag it accordingly. - M.Teich
Someone did already and then deleted - David Arenburg

1 Answers

1
votes

In data.table, you can do the following:

setDT(df)
df[,lapply(.SD, function(x) sum(x > 0)),.(idA,idB), .SDcols = setdiff(names(df), c('idA','idB'))]