I am trying to apply multiple conditions to multiple columns of a data.frame where condition i should be applied to column i, i.e. the applied condition is dependent on the column I am in. I have a working solution but it has two major drawbacks, it's potentially slow on large data as it uses a for loop and it requires the two input vectors "columns the condition is applied to" and "condition to be applied" in the same order. I envisaged a solution that utilizes fast data wrangling package functions e.g. dplyr, data.table and is more flexible with respect to order of argument vector elements. an example should make it clear (here the condition is only a threshold test but in the bigger problem it may be a more complex boolean expression involving variables of the data set).
t <- structure(list(a = c(2L, 10L, 10L, 10L, 3L),
b = c(5L, 10L, 20L, 20L, 20L),
c = c(100L, 100L, 100L, 100L, 100L)),
.Names = c("a", "b", "c"),
class = "data.frame",
row.names = c(NA, -5L))
foo_threshold <-
function(data, cols, thresholds, condition_name){
df <- data.frame(matrix(ncol = length(cols), nrow = nrow(data)))
colnames(df) <- paste0(cols, "_", condition_name)
for (i in 1:length(cols)){
df[,i] <- ifelse(data[,i] > thresholds[i],T,F)
}
return(df)
}
foo_threshold(data = t, cols = c("a", "b"), thresholds = c(5, 18),
condition_name = "bigger_threshold")
I have tried to solve it in a dplyr chain but I fail to pass the argument vectors correctly, i.e. how to make it clear that he should apply condition i to column i. below an illustration where I was going. it's not working and it misses some points but I think it illustrates what I am trying to achieve. note that here conditions are assumed to be in a data.frame where column variable holds the col names and threshold is extracted via a lookup (dplyr filer + select chain).
foo_threshold <- function(data, cols, thresholds, cond_name) {
require(dplyr)
# fun to evaluate boolean condition
foo <- function(x) {
threshold <- thresholds %>% filter(variable==x) %>% select(threshold)
temp <- ifelse(x > threshold, T, F)
return(temp)
}
vars <- setNames(cols, paste0(cols,"_",cond_name))
df_out <-
data %>%
select_(.dots = cols) %>%
mutate_(funs(foo(.)), vars) %>%
select_(.dots = names(vars))
return(df_out)
}
# create threshold table
temp <-
data.frame(variable = c("a", "b"),
threshold = c(5, 18),
stringsAsFactors = F)
# call function (doesn't work)
foo_threshold(data = t, thresholds = temp, cond_name = "bigger_threshold")
Edit: @thepule data.frame of conditions may look like below where x is the column. so each condition is evaluated for each row of its corresponding column.
conditions <-
data.frame(variable = c("a", "b"),
condition = c("x > 5 and x < 10", "!x %in% c("o", "p")"),
stringsAsFactors = F)
t
as that is the transpose function – bouncyball