1
votes

I am trying to apply multiple conditions to multiple columns of a data.frame where condition i should be applied to column i, i.e. the applied condition is dependent on the column I am in. I have a working solution but it has two major drawbacks, it's potentially slow on large data as it uses a for loop and it requires the two input vectors "columns the condition is applied to" and "condition to be applied" in the same order. I envisaged a solution that utilizes fast data wrangling package functions e.g. dplyr, data.table and is more flexible with respect to order of argument vector elements. an example should make it clear (here the condition is only a threshold test but in the bigger problem it may be a more complex boolean expression involving variables of the data set).

t <- structure(list(a = c(2L, 10L, 10L, 10L, 3L), 
                    b = c(5L, 10L, 20L, 20L, 20L), 
                    c = c(100L, 100L, 100L, 100L, 100L)), 
               .Names = c("a", "b", "c"), 
               class = "data.frame", 
               row.names = c(NA, -5L))

foo_threshold <- 
  function(data, cols, thresholds, condition_name){
    df <- data.frame(matrix(ncol = length(cols), nrow = nrow(data)))
    colnames(df) <- paste0(cols, "_", condition_name)

    for (i in 1:length(cols)){
      df[,i] <- ifelse(data[,i] > thresholds[i],T,F)
      }
    return(df)
    }

foo_threshold(data = t, cols = c("a", "b"), thresholds = c(5, 18), 
              condition_name = "bigger_threshold")

I have tried to solve it in a dplyr chain but I fail to pass the argument vectors correctly, i.e. how to make it clear that he should apply condition i to column i. below an illustration where I was going. it's not working and it misses some points but I think it illustrates what I am trying to achieve. note that here conditions are assumed to be in a data.frame where column variable holds the col names and threshold is extracted via a lookup (dplyr filer + select chain).

foo_threshold <- function(data, cols, thresholds, cond_name) {
  require(dplyr)
  # fun to evaluate boolean condition
  foo <- function(x) {
    threshold <- thresholds %>% filter(variable==x) %>% select(threshold)
    temp <- ifelse(x > threshold, T, F)
    return(temp)
    }

  vars <- setNames(cols, paste0(cols,"_",cond_name))

  df_out <-
    data %>%
    select_(.dots = cols) %>%
    mutate_(funs(foo(.)), vars) %>%
    select_(.dots = names(vars))

  return(df_out)
  }

# create threshold table
temp <- 
  data.frame(variable = c("a", "b"),
             threshold = c(5, 18),
             stringsAsFactors = F)

# call function (doesn't work)
foo_threshold(data = t, thresholds = temp, cond_name = "bigger_threshold")

Edit: @thepule data.frame of conditions may look like below where x is the column. so each condition is evaluated for each row of its corresponding column.

conditions <- 
  data.frame(variable = c("a", "b"),
             condition = c("x > 5 and x < 10", "!x %in% c("o", "p")"),
             stringsAsFactors = F)
4
I would caution against naming an object t as that is the transpose functionbouncyball
thanks for the hint. it was supposed to stand for temp and I was lazy.Triamus
OK, it is definitely trickier with conditions in that format. Interesting problem though.thepule
OK, it is definitely trickier with conditions in that format. Interesting problem though. What do "o" and "p" represent there?thepule
Example was not perfect. Wanted to make clear that boolean does not neef to be applied only to numeric but also character variables. But of course it didn't make sense to compare against numeric and character values at the same time.Triamus

4 Answers

1
votes

Final attempt at an answer. Tried to make the code more generic so that it can accept arbitrary functions. Nicely it also appears to run significantly faster than any of my previous answers. I am also quite tired atm so apologies if I have made a silly mistake.

temp <- structure(list(a = c(2L, 10L, 10L, 10L, 3L), 
                       b = c(5L, 10L, 20L, 20L, 20L), 
                       c = c(100L, 100L, 100L, 100L, 100L)), 
                  .Names = c("a", "b", "c"), 
                  class = "data.frame", 
                  row.names = c(NA, -5L))



condition <- c(function(x)  x> 5  , 
               function(x) x > 18 )


foo_threshold <- function ( data , cols , threshold , condition_name ) {
    dat <- data[0]
    for ( i in 1:length(condition))     dat[cols[i]] <- condition[[i]]( data[[cols[i]]] )
    names(dat) <- paste0( cols , "_" , condition_name)
    return(dat)
}


foo_threshold(data = temp, cols = c("a", "b"), threshold  = condition , 
           condition_name = "bigger_threshold")
2
votes

Made another attempt using sweep instead of mapply. Left the previous answer as it was as I feel it adds value showing how inefficient mapply is. This new answer appears to run at a little over twice as fast as the OP. I think it is a tiny bit slower than the current best rated answer but has slightly more concise code.

It runs even faster if you are willing to accept the result as a matrix instead of a data.frame.

library(dplyr) 


temp <- structure(list(a = c(2L, 10L, 10L, 10L, 3L), 
                    b = c(5L, 10L, 20L, 20L, 20L), 
                    c = c(100L, 100L, 100L, 100L, 100L)), 
                    .Names = c("a", "b", "c"), 
                    class = "data.frame", 
                    row.names = c(NA, -5L))

foo_threshold <- function(data , cols , thresholds , condition_name ) {
    dat <- sweep ( data [ cols ] , 2  ,  thresholds , ">" ) %>%  as.data.frame()
    names(dat) <- paste0(names(dat) , "_" , condition_name)
    return(dat)
} 

foo_threshold(data = temp, cols = c("a", "b"), thresholds = c(5, 18), 
              condition_name = "bigger_threshold")
1
votes

Try this:

library(dplyr)

 foo_threshold <-
     function(data, cols, thresholds, condition_name){
         temp <- rbind(data[,cols], thresholds) %>%
         lapply(function(x) x[1:length(x)-1] > last(x)) %>% data.frame()
         colnames(temp) <- paste0(cols, "_", condition_name)
         return(temp)

     }

 foo_threshold(data = t, cols = c("a", "b"), thresholds = c(5, 18), 
               condition_name = "bigger_threshold")

In order to test which is faster:

test <- data.frame(a = runif(10000000), b = runif(10000000), stringsAsFactors = F)

 lapply(list(foo_threshold_original, foo_threshold),
        function(x) system.time(x(data = test, cols = c("a", "b"), thresholds = c(0.5, 0.8), 
                      condition_name = "bigger_threshold")))

where foo_threshold_original is your initial version. The result is:

[[1]]
   user  system elapsed 
   3.95    0.64    4.58 

[[2]]
   user  system elapsed 
   1.73    0.24    1.96

So the new version is actually faster on bigger data frames.

0
votes

How about this ? Doesn't use dplyr (I loaded it anyway to use the pipes though)

library(dplyr)

foo_threshold <- function( data , cols , thresholds , condition_name){
    dat <- mapply( function(x , val)  x > val  , data[cols] , thresholds ) %>%  as.data.frame
    names(dat) <- paste0(names(dat) , "_" , condition_name)
    return(dat)
}

edit: simplified