r apply multiple conditions to multiple columns (vectors of function argument values)

Question

I am trying to apply multiple conditions to multiple columns of a data.frame where condition i should be applied to column i, i.e. the applied condition is dependent on the column I am in. I have a working solution but it has two major drawbacks, it's potentially slow on large data as it uses a for loop and it requires the two input vectors "columns the condition is applied to" and "condition to be applied" in the same order. I envisaged a solution that utilizes fast data wrangling package functions e.g. dplyr, data.table and is more flexible with respect to order of argument vector elements. an example should make it clear (here the condition is only a threshold test but in the bigger problem it may be a more complex boolean expression involving variables of the data set).

t <- structure(list(a = c(2L, 10L, 10L, 10L, 3L), 
                    b = c(5L, 10L, 20L, 20L, 20L), 
                    c = c(100L, 100L, 100L, 100L, 100L)), 
               .Names = c("a", "b", "c"), 
               class = "data.frame", 
               row.names = c(NA, -5L))

foo_threshold <- 
  function(data, cols, thresholds, condition_name){
    df <- data.frame(matrix(ncol = length(cols), nrow = nrow(data)))
    colnames(df) <- paste0(cols, "_", condition_name)

    for (i in 1:length(cols)){
      df[,i] <- ifelse(data[,i] > thresholds[i],T,F)
      }
    return(df)
    }

foo_threshold(data = t, cols = c("a", "b"), thresholds = c(5, 18), 
              condition_name = "bigger_threshold")

I have tried to solve it in a dplyr chain but I fail to pass the argument vectors correctly, i.e. how to make it clear that he should apply condition i to column i. below an illustration where I was going. it's not working and it misses some points but I think it illustrates what I am trying to achieve. note that here conditions are assumed to be in a data.frame where column variable holds the col names and threshold is extracted via a lookup (dplyr filer + select chain).

foo_threshold <- function(data, cols, thresholds, cond_name) {
  require(dplyr)
  # fun to evaluate boolean condition
  foo <- function(x) {
    threshold <- thresholds %>% filter(variable==x) %>% select(threshold)
    temp <- ifelse(x > threshold, T, F)
    return(temp)
    }

  vars <- setNames(cols, paste0(cols,"_",cond_name))

  df_out <-
    data %>%
    select_(.dots = cols) %>%
    mutate_(funs(foo(.)), vars) %>%
    select_(.dots = names(vars))

  return(df_out)
  }

# create threshold table
temp <- 
  data.frame(variable = c("a", "b"),
             threshold = c(5, 18),
             stringsAsFactors = F)

# call function (doesn't work)
foo_threshold(data = t, thresholds = temp, cond_name = "bigger_threshold")

Edit: @thepule data.frame of conditions may look like below where x is the column. so each condition is evaluated for each row of its corresponding column.

conditions <- 
  data.frame(variable = c("a", "b"),
             condition = c("x > 5 and x < 10", "!x %in% c("o", "p")"),
             stringsAsFactors = F)

I would caution against naming an object t as that is the transpose function — bouncyball
thanks for the hint. it was supposed to stand for temp and I was lazy. — Triamus
OK, it is definitely trickier with conditions in that format. Interesting problem though. — thepule
OK, it is definitely trickier with conditions in that format. Interesting problem though. What do "o" and "p" represent there? — thepule
Example was not perfect. Wanted to make clear that boolean does not neef to be applied only to numeric but also character variables. But of course it didn't make sense to compare against numeric and character values at the same time. — Triamus

gowerc gowerc · Accepted Answer · 2016-06-30T15:14:35

Final attempt at an answer. Tried to make the code more generic so that it can accept arbitrary functions. Nicely it also appears to run significantly faster than any of my previous answers. I am also quite tired atm so apologies if I have made a silly mistake.

temp <- structure(list(a = c(2L, 10L, 10L, 10L, 3L), 
                       b = c(5L, 10L, 20L, 20L, 20L), 
                       c = c(100L, 100L, 100L, 100L, 100L)), 
                  .Names = c("a", "b", "c"), 
                  class = "data.frame", 
                  row.names = c(NA, -5L))



condition <- c(function(x)  x> 5  , 
               function(x) x > 18 )


foo_threshold <- function ( data , cols , threshold , condition_name ) {
    dat <- data[0]
    for ( i in 1:length(condition))     dat[cols[i]] <- condition[[i]]( data[[cols[i]]] )
    names(dat) <- paste0( cols , "_" , condition_name)
    return(dat)
}


foo_threshold(data = temp, cols = c("a", "b"), threshold  = condition , 
           condition_name = "bigger_threshold")

r apply multiple conditions to multiple columns (vectors of function argument values)

4 Answers