1
votes

I converted iris data set to data.table format.My aim was to take mean of all columns and group them by Species using data.table.

DT <-as.data.table(iris)

Below is my desired output

DT[,.(mean(Sepal.Length),mean(Sepal.Width),mean(Petal.Length),mean(Petal.Width)),by =.(Species)]
      Species    V1    V2    V3    V4
1:     setosa 5.006 3.428 1.462 0.246
2: versicolor 5.936 2.770 4.260 1.326
3:  virginica 6.588 2.974 5.552 2.026

But typing all the column names using the above manner when we have large number of column is time confusing. I tried this using the below command.But the output is arranged in a slightly different manner

DT[, .(vapply(DT[,!'Species',with=FALSE],mean,FUN.VALUE =1)),by = .(Species)]
               Species       V1
         1:     setosa 5.843333
         2:     setosa 3.057333
         3:     setosa 3.758000
         4:     setosa 1.199333
         5: versicolor 5.843333
         6: versicolor 3.057333
         7: versicolor 3.758000
         8: versicolor 1.199333
         9:  virginica 5.843333
        10:  virginica 3.057333
        11:  virginica 3.758000
        12:  virginica 1.199333

is there any way to use above kind of things and get rid of typing all the column names just to take out the mean and group them by species using data.table

Please Do not suggest to use 'with = FALSE' . I already knew that.

1
Think this might be a duplicate, but DT[, lapply(.SD,mean), by=Species, .SDcols=names(DT)[1:4]] or similar should sort you out.thelatemail
Possibly a duplicate of stackoverflow.com/questions/16783598/… until I can find an exact match.thelatemail
that wasn't helpful for me. Can you perform it ?user5433002
@learner - it gives exactly the same output as your desired output doesn't it? In fact it can be slightly simplified to DT[, lapply(.SD,mean), by=Species] if you just want to exclude the by= variable.thelatemail
Have you gone through the Introduction to data.table vignette? Specifically section 2e.Arun

1 Answers

0
votes
iris <- as.data.table(iris)
iris <- iris[,lapply(.SD,mean), by=Species]

Desired Output:-

structure(list(Species = structure(1:3, .Label = c("setosa", 
"versicolor", "virginica"), class = "factor"), Sepal.Length = c(5.006, 
5.936, 6.588), Sepal.Width = c(3.428, 2.77, 2.974), Petal.Length = c(1.462, 
4.26, 5.552), Petal.Width = c(0.246, 1.326, 2.026)), row.names = c(NA, 
-3L), class = c("data.table", "data.frame"), .Names = c("Species", 
"Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"), 
.internal.selfref = <pointer: 0x0000000003fd0788>)

Suppose you want mean of only "Sepal.Length" and "Sepal.Width" columns. You can do that by slight modification as follows:-

iris <- iris[,lapply(.SD,mean), by=Species, .SDcols = c("Sepal.Length", "Sepal.Width")]

In data.table, DT[i,j,by], j must return a list and lapply returns list