Filter out data.table columns based on summary statistics

Question

I often need to filter out columns with a low variance from a data.table. The column names are not known in advance.

dt = data.table(mtcars)

# calculate standard deviation with arbitrary max value of 1:
mask = dt[,lapply(.SD, function(x) sd(x, na.rm = TRUE) > 1)]

# The columns with the FALSE values in row 1 need to be removed
mask.t = t(mask)
mask.t = which(mask.t)
dt[,mask.t,with=FALSE]

The approach above is clunky. Is there a more elegant way to filter out columns out of a data.table for which the column statistic evaluates to TRUE?

dt[, names(mask)[unlist(mask)], with=FALSE] maybe? Or dt[, names(which(unlist(mask))), with=FALSE] — Arun
Thanks Arun, they both work. It was the unlist part that I missed. — Henk

Henk Henk · Accepted Answer · 2014-03-21T17:29:10

These work:

dt[, .SD, .SDcols=unlist(mask)] 

dt[, .SD, .SDcols=which(unlist(mask))]

All together now:

variance.filter = function(df) {
  mask = df[,lapply(.SD, function(x) sd(x,na.rm = TRUE) > 1)]
  df = df[, .SD, .SDcols = unlist(mask)] 
}

EDIT in the current development version of `data.table` (1.12.9), `.SDcols` accepts a function filter for columns, so this would work:

variance.filter = function(df) {
  df[ , .SD, .SDcols = function(x) sd(x, na.rm = TRUE) > 1]
}

Filter out data.table columns based on summary statistics

1 Answers

EDIT in the current development version of data.table (1.12.9), .SDcols accepts a function filter for columns, so this would work:

EDIT in the current development version of `data.table` (1.12.9), `.SDcols` accepts a function filter for columns, so this would work: