2
votes

I very often use logical vectors to subset other vectors, matrices and data frames (in the genomics field, it's very common). On such vector would be made like so:

condition <- myNucleotideVector == "G" 

then I work on subsets matching that condition. So I often end up with clumsy code like:

myNucleotideVector <- myNucleotideVector[condition]
object2 <- object2[condition]
dataframe1 <- dataframe1[conditon,]

or

result <- myNucleotideVector[condition] - object2[condition] + dataframe1[conditon,2]

which repeats [condition] many times. I could put all vectors of same size and a data frame and subset it, but I don't always want to do it (in particular given the length of vectors, which can be millions).

I'm looking for an efficient solution. For the first case quoted above, I thought about a function that takes any object provided as argument and subsets it. That would look like

subsetObjects <- function(..., condition, env = globalenv()) {          
    call <- as.character(match.call())
    names <- call[2:(length(call)-1)] #this isn't ideal as it doesn't consider the case where one would place 'condition' argument before the objects to subset
    for (name in names) {    
        value <- get(name, envir = env)
        assign(name, subset(value, subset = condition),envir = env)
    }
}

As you see in the comment, it isn't perfect. Maybe someone can suggest something more efficient.

For the second case, I'm looking for something similar to with(), in which every vector, matrix or data frame would be automatically subset according to the condition. That would look like

result <- withCondition(condition, expression)

If no such function exist, I could write my own, but I'm not sure how to do it.

Thanks

Jean

1
why not just use something like lapply(list_of_matrices, function(x) x[condition, ]) ?? or even easier: lapply(list_of_matrices, subset, vector == "C") , though I am not sure I unterstand 100 % what you want, one clear example from start to finish would help - grrgrrbla
Doing so would first require creating a list of objects and then retrieving objects from the list, and I'm not much used to working with lists. But I take note.n - jeanlain
I want to turn result <- myNucleotideVector[condition] - object2[condition] + dataframe1[conditon,2] into result <- withCondition(condition, myNucleotideVector - object2 + dataframe1[,2]) but I need an appropriate withCondition function(). - jeanlain

1 Answers

1
votes

Here's one idea, perhaps a little unusual: Instead of working directly with the underlying objects in your code, you could create a single "getter" function which would take only one argument: the name of the variable you want to instantiate at that point in the code. You could take it as a string, or, even better, use substitute() to allow an unquoted symbol to be used (actually, I ended up using as.character(substitute(var)), so both work). Inside the function you can look up a "global condition" to decide how to subset the variable, if it should be subsetted at all. For maximum flexibility, the lookup table can also map each variable to a specific condition for that variable. Here's how I envision it:

## lookup table and getter
cond.to.vec <- list();
COND.NAME.GLOBAL <- '!global';
var.to.cond <- list();
cond.register <- function(name,vec=NULL) {
    prev.vec <- cond.to.vec[[name]];
    cond.to.vec[[name]] <<- vec;
    invisible(prev.vec);
};
cond.is.registered <- function(name) !is.null(cond.to.vec[[name]]);
cond.map <- function(var.name,cond.name=NULL) {
    ## remove previous mapping
    prev.mapping <- var.to.cond[[var.name]];
    var.to.cond[[var.name]] <<- NULL;
    ## omit cond.name arg to just remove
    if (is.null(cond.name)) return(invisible(prev.mapping));
    ## ensure cond.name has been registered
    if (!cond.is.registered(cond.name)) stop(paste0(cond.name,' not registered'));
    ## now add new cond.name mapping for var.name
    var.to.cond[[var.name]] <<- cond.name;
    invisible(prev.mapping);
};
cond.set <- function(var,cond.vec=NULL,sym=T) {
    var.name <- if (sym) as.character(substitute(var)) else var;
    cond.register(var.name,cond.vec);
    cond.map(var.name,if (is.null(cond.vec)) NULL else var.name);
};
cond.set.global <- function(vec=NULL) cond.register(COND.NAME.GLOBAL,vec);
cond.look.up <- function(var.name) {
    ## 1: specific condition
    cond.name <- var.to.cond[[var.name]];
    if (!is.null(cond.name)) return(cond.to.vec[[cond.name]]);
    ## 2: global condition
    vec <- cond.to.vec[[COND.NAME.GLOBAL]];
    if (!is.null(vec)) return(vec);
    ## 3: no condition
    T;
};

ss <- function(var,sym=T) {
    ## whitelist subsettables
    if (!typeof(var)%in%sapply(list(as.raw(0),T,0L,0,0i,'',list(),expression()),typeof))
        return(var);
    var.name <- if (sym) as.character(substitute(var)) else var;
    vec <- cond.look.up(var.name);
    if (length(dim(var)) == 2L) var[vec,] else var[vec];
};

## test data
set.seed(1);
N <- 10;
myNucleotideVector <- sample(c('A','C','T','G'),N,replace=T);
myNucleotideVectorNum <- sample(100:200,N,replace=T);
object2 <- seq_len(N);
dataframe1 <- data.frame(base=sample(c('A','C','T','G'),N,replace=T),x=sample(1:100,N));

## global condition
cond.set.global(myNucleotideVector == 'G');

## main code, uses global condition
result <- ss(myNucleotideVectorNum) - ss(object2) + ss(dataframe1)$x;

## register separate condition for object2
cond.set(object2,object2%%3 == 0);
result2 <- ss(myNucleotideVectorNum) - ss(object2) + ss(dataframe1)$x;

## unset/unregister all conditions to work with the entire data set
cond.set.global();
cond.set(object2);
result3 <- ss(myNucleotideVectorNum) - ss(object2) + ss(dataframe1)$x;

result;
## [1] 153 208 240
result2;
## [1] 154 208 238
result3;
##  [1] 168 175 266 153 252 208 240 203 196 206

Now we can enhance the above code with some more functions to provide a less invasive means of applying subsetting conditions:

ss.all.sub <- function(pt) {
    if (typeof(pt) == 'symbol') ## wrap all symbols in ss()
        as.call(list(as.symbol('ss'),pt))
    else if (typeof(pt) == 'language' && length(pt) >= 2L) ## handle function args
        as.call(c(pt[[1]], ## pass function symbol untouched
            if (as.character(pt[[1]]) == '$') ## special case for $ operator
                list(ss.all.sub(pt[[2]]),pt[[3]]) ## pass RHS untouched
            else
                lapply(pt[-1],ss.all.sub) ## recurse on all args
        ))
    else pt; ## pass literals and nullary calls untouched
};

ss.all <- function(expr) eval(ss.all.sub(substitute(expr)));

ss.with <- function(cond.arg,expr) {
    if (is.list(cond.arg)) {
        prevs <- vector('list',length(cond.arg));
        for (i in seq_along(cond.arg)) {
            name <- names(cond.arg)[i];
            prevs[i] <- list(
                if (isTRUE(name != '' && name != COND.NAME.GLOBAL))
                    cond.set(name,cond.arg[[i]],sym=F)
                else
                    cond.set.global(cond.arg[[i]])
            );
        };
    } else prev <- cond.set.global(cond.arg);
    res <- eval(ss.all.sub(substitute(expr)));
    if (is.list(cond.arg)) {
        for (i in seq_along(cond.arg)) {
            name <- names(cond.arg)[i];
            if (isTRUE(name != '' && name != COND.NAME.GLOBAL))
                cond.set(name,prevs[[i]],sym=F)
            else
                cond.set.global(prevs[[i]]);
        };
    } else cond.set.global(prev);
    res;
};

## demo parse tree substitution
ss.all.sub(substitute(myNucleotideVectorNum - object2 + dataframe1$x));
## ss(myNucleotideVectorNum) - ss(object2) + ss(dataframe1)$x

## demo using ss.with() to apply an inline condition
ss.with(myNucleotideVector == 'G',myNucleotideVectorNum - object2 + dataframe1$x);
## [1] 153 208 240
ss.with(
    list(myNucleotideVector == 'G',object2=object2%%3 == 0),
    myNucleotideVectorNum - object2 + dataframe1$x
);
## [1] 154 208 238
ss.with(T,myNucleotideVectorNum - object2 + dataframe1$x);
##  [1] 168 175 266 153 252 208 240 203 196 206