1
votes

I would like to apply a function to an R data table object that compares values in two columns and returns a result. Here's the example, for data table X:

X <- as.data.table(list(POSITION=c(1,4,5,9,24,36,42,56),
   FIRST=c("A","BB","AA","B","AAA","B","A,B"),
   SECOND=c("B","AA","B","AAA","BBB","AB,ABB","B,A")))

   POSITION FIRST SECOND
1:        1     A      B
2:        4    BB     AA
3:        5    AA      B
4:        9     B    AAA
5:       24   AAA    BBB
6:       36     B AB,ABB
7:       42   A,B    B,A
8:       56     A      B

I would like to perform the following logical comparisons of the data in columns "FIRST" and "SECOND", to create a "RESULT" column:

 SAME = length of FIRST and SECOND are both one character
 BLOCK = Character length of FIRST and SECOND are the same,
         but greater than one, and not mixed (i.e. no comma)
 LESS = SECOND has fewer characters, but neither is mixed
 MORE = SECOND has more characters, but neither is mixed
 MIXED = either firs of second contains a comma

Thus, the desired result would look like:

POSITION FIRST SECOND RESULTS
1        A     B      SAME
4        BB    AA     BLOCK
5        A     B,A    MIXED    
9        AA    B      LESS
24       B     AAA    MORE
28       BBB   A,B    MIXED
36       AAA   BBB    BLOCK
42       B     AB,ABB MIXED
56       A,B   B,A    MIXED

So the following works, but is slow over a file with 4 million rows!

X[, RESULT := ifelse(nchar(FIRST)+nchar(SECOND)==2,"SAME",
    ifelse(grepl(",", FIRST) | grepl(",",SECOND), "MIXED",
       ifelse(nchar(FIRST) > nchar(SECOND), "LESS",
          ifelse(nchar(FIRST) < nchar(SECOND), "MORE","BLOCK")))]

But it does give thew desired result:

   POSITION FIRST SECOND RESULT
1:        1     A      B   SAME
2:        4    BB     AA  BLOCK
3:        5    AA      B   LESS
4:        9     B    AAA   MORE
5:       24   AAA    BBB  BLOCK
6:       36     B AB,ABB  MIXED
7:       42   A,B    B,A  MIXED
8:       56     A      B   SAME

I actually have several more conditions to test, and some of them get more complicated that just character counts. Rather than a long ifelse statement, is it possible to apply a function, taking the two columns as input? For example:

checkType <- function(x) {
  if(nchar(x$FIRST)+nchar(x$SECOND)==2) {
    type <- "SNP"
  } else if(!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) > nchar(x$SECOND))) {
    type <- "LESS"
  } else if(!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) < nchar(x$SECOND))) {
    type <- "MORE"
  } else if (!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) == nchar(x$SECOND)) & nchar(x$SECOND)>1) {
    type <-"BLOCK"
  } else {
    type <- "MIXED"
  }
  return(type)
}

> checkType(X[1,])
[1] "SAME"

for(i in 1:nrow(X)) X[i, RESULT := checkType(X[i,])]

So while the above works, it's obviously not the optimal way to run things with data.table. However, I tried lapply and apply, but neither work:

X[, RESULT3 := lapply(.SD, checkType)]
 Error in x$FIRST : $ operator is invalid for atomic vectors 
  nchar(x$FIRST) 
  FUN(X[[1L]], ...) 
  lapply(.SD, checkType) 
  eval(expr, envir, enclos) 
  eval(jsub, SDenv, parent.frame()) 
  `[.data.table`(X, , `:=`(RESULT3, lapply(.SD, checkType))) 
  X[, `:=`(RESULT3, lapply(.SD, checkType))] 

Same result with apply(.SD, 1, checkType). Is what I am trying to do possible by applying a function?

2
I'm guessing that if you did it in steps and avoided ifelse altogether it'd be faster - i.e. X[nchar(FIRST)+nchar(SECOND)==2, RESULT := "SAME"]; X[grepl(",", FIRST) | grepl(",",SECOND), RESULT := "MIXED"]; ...eddi
You probably shouldn't do it (because it's slow), but one way to use that function would be X[,RESULT3:=checkType(.SD),by=1:nrow(X)]Frank
This worked (unlike apply or lapply), but was 40 times slower than your answer below. I guess that using apply or lapply when comparing values in a subset of columns is not possible. Certainly, you can apply functions like "sum" on a subset of columns, but to have a list of conditions to apply to X$Col1 & X$Col2 and return different results accordingly, can't be done using 'lapply' or 'apply'.Pete

2 Answers

1
votes

Note that the data table produced by your code (first line below, pasted from your snippet above), is not the same as the data table shown in the "desired results" box below it.

Nevertheless, this might actually be faster, and would definitely be easier to understand. It produces a result which I think is consistent with your rules.

X <- as.data.table(list(POSITION=c(1,4,5,9,24,36,42,56),
                        FIRST=c("A","BB","AA","B","AAA","B","A,B"),
                        SECOND=c("B","AA","B","AAA","BBB","AB,ABB","B,A")))

X$mixed <- grepl(',',X$FIRST) | grepl(',',X$SECOND)
X$nf    <- nchar(X$FIRST)
X$ns    <- nchar(X$SECOND)
X$RESULT = ""

setkey(X,nf,ns)
X[J(1,1),RESULT:="SAME"]
X[!mixed & nf==ns & nf>1 & ns>1]$RESULT <- "BLOCK"
X[!mixed & nf > ns]$RESULT <- "LESS"
X[!mixed & nf < ns]$RESULT <- "MORE"
X[(mixed)]$RESULT <- "MIXED"
setkey(X,POSITION)

Your categories are not mutually exclusive, so I assume these rules apply in order (for example what about FIRST="," and SECOND=","?

Also, I think your definitions of MORE and LESS are the same.

1
votes

So both the answers from @Frank and @jlhoward give the desired result, and were much quicker than my initial attempt. From these answers however, this approach (createResult1) was about 4 times faster over a file with 1,000,000 rows:

createResult1 <- function(X) {
  X[,`:=`(
    cf=nchar(FIRST),
    cs=nchar(SECOND),
    mf=grepl(',',FIRST),
    ms=grepl(',',SECOND)
    )]
  X[cf==1&cs==1, RESULT:="SAME"]
  X[cf > cs, RESULT:="LESS"]
  X[cf < cs, RESULT:="MORE"]
  X[cf==cs & cs>1, RESULT:="BLOCK"]
  X[(mf)|(ms), RESULT:="MIXED"]
  X[,c('cf','cs','mf','ms'):=NULL]
  return(X)
}

createResult2 <- function(X) { #@Frank
  X[,`:=`(
    cf=nchar(FIRST),
    cs=nchar(SECOND),
    mf=grepl(',',FIRST),
    ms=grepl(',',SECOND)
  )][,RESULT:=ifelse(cf==1&cs==1,"SAME",
                     ifelse(mf | ms, "MIXED",
                            ifelse(cf > cs, "LESS",
                                   ifelse(cf < cs, "MORE","BLOCK"))))
     ][
       ,c('cf','cs','mf','ms'):=NULL
        ]
  return(X)
}

createResult3 <- function(X) { #@jlhoward
  X$mixed <- grepl(',',X$FIRST) | grepl(',',X$SECOND)
  X$nf    <- nchar(X$FIRST)
  X$ns    <- nchar(X$SECOND)
  X$RESULT = ""

  setkey(X,nf,ns)
  X[J(1,1),RESULT:="SAME"]
  X[!mixed & nf==ns & nf>1 & ns>1]$RESULT <- "BLOCK"
  X[!mixed & nf > ns]$RESULT <- "LESS"
  X[!mixed & nf < ns]$RESULT <- "MORE"
  X[(mixed)]$RESULT <- "MIXED"
  X[,c('nf','ns','mixed'):=NULL]
  setkey(X,POSITION)
  return(X)
}

Create the same data table as above, but with 1,000,000 rows

X <- as.data.table(list(POSITION=rep(c(1,4,5,9,24,36,42,56),1000000),
                        FIRST=rep(c("A","BB","AA","B","AAA","B","A,B"),1000000),
                        SECOND=rep(c("B","AA","B","AAA","BBB","AB,ABB","B,A"),1000000)))
Y <- copy(X)
Z <- copy(X)

Here are the results:

> system.time(X <- createResult1(X))
   user  system elapsed 
   4.06    0.05    4.12
> system.time(Y <- createResult2(Y))
   user  system elapsed 
  18.53    0.36   18.94 
> system.time(Z <- createResult2(Z))
   user  system elapsed 
  18.63    0.29   18.97 
> identical(X,Y)
[1] TRUE
> identical(X,Z)
[1] TRUE