I would like to apply a function to an R data table object that compares values in two columns and returns a result. Here's the example, for data table X:
X <- as.data.table(list(POSITION=c(1,4,5,9,24,36,42,56),
FIRST=c("A","BB","AA","B","AAA","B","A,B"),
SECOND=c("B","AA","B","AAA","BBB","AB,ABB","B,A")))
POSITION FIRST SECOND
1: 1 A B
2: 4 BB AA
3: 5 AA B
4: 9 B AAA
5: 24 AAA BBB
6: 36 B AB,ABB
7: 42 A,B B,A
8: 56 A B
I would like to perform the following logical comparisons of the data in columns "FIRST" and "SECOND", to create a "RESULT" column:
SAME = length of FIRST and SECOND are both one character
BLOCK = Character length of FIRST and SECOND are the same,
but greater than one, and not mixed (i.e. no comma)
LESS = SECOND has fewer characters, but neither is mixed
MORE = SECOND has more characters, but neither is mixed
MIXED = either firs of second contains a comma
Thus, the desired result would look like:
POSITION FIRST SECOND RESULTS
1 A B SAME
4 BB AA BLOCK
5 A B,A MIXED
9 AA B LESS
24 B AAA MORE
28 BBB A,B MIXED
36 AAA BBB BLOCK
42 B AB,ABB MIXED
56 A,B B,A MIXED
So the following works, but is slow over a file with 4 million rows!
X[, RESULT := ifelse(nchar(FIRST)+nchar(SECOND)==2,"SAME",
ifelse(grepl(",", FIRST) | grepl(",",SECOND), "MIXED",
ifelse(nchar(FIRST) > nchar(SECOND), "LESS",
ifelse(nchar(FIRST) < nchar(SECOND), "MORE","BLOCK")))]
But it does give thew desired result:
POSITION FIRST SECOND RESULT
1: 1 A B SAME
2: 4 BB AA BLOCK
3: 5 AA B LESS
4: 9 B AAA MORE
5: 24 AAA BBB BLOCK
6: 36 B AB,ABB MIXED
7: 42 A,B B,A MIXED
8: 56 A B SAME
I actually have several more conditions to test, and some of them get more complicated that just character counts. Rather than a long ifelse statement, is it possible to apply a function, taking the two columns as input? For example:
checkType <- function(x) {
if(nchar(x$FIRST)+nchar(x$SECOND)==2) {
type <- "SNP"
} else if(!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) > nchar(x$SECOND))) {
type <- "LESS"
} else if(!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) < nchar(x$SECOND))) {
type <- "MORE"
} else if (!grepl(",", x$SECOND) & !grepl(",",x$FIRST) & (nchar(x$FIRST) == nchar(x$SECOND)) & nchar(x$SECOND)>1) {
type <-"BLOCK"
} else {
type <- "MIXED"
}
return(type)
}
> checkType(X[1,])
[1] "SAME"
for(i in 1:nrow(X)) X[i, RESULT := checkType(X[i,])]
So while the above works, it's obviously not the optimal way to run things with data.table. However, I tried lapply and apply, but neither work:
X[, RESULT3 := lapply(.SD, checkType)]
Error in x$FIRST : $ operator is invalid for atomic vectors
nchar(x$FIRST)
FUN(X[[1L]], ...)
lapply(.SD, checkType)
eval(expr, envir, enclos)
eval(jsub, SDenv, parent.frame())
`[.data.table`(X, , `:=`(RESULT3, lapply(.SD, checkType)))
X[, `:=`(RESULT3, lapply(.SD, checkType))]
Same result with apply(.SD, 1, checkType). Is what I am trying to do possible by applying a function?
ifelse
altogether it'd be faster - i.e.X[nchar(FIRST)+nchar(SECOND)==2, RESULT := "SAME"]; X[grepl(",", FIRST) | grepl(",",SECOND), RESULT := "MIXED"]; ...
– eddiX[,RESULT3:=checkType(.SD),by=1:nrow(X)]
– Frank