2
votes

I have a matrix with multiple individuals in rows and multiple nucleotides (values) in columns. It looks like this:

     [,1][,2][,3][,4] ...
ind1   a   c   a   a
ind2   a   c   t   t
ind3   a   g   g   c
ind4   a   g   g   g
.
.
.

Now I would like to ignore all columns where only one value occurs (as in the example above the first column) and convert every column with two, three and four (no more than 4 is possible!) different nucleotides (values) into binary format. In the end it should look like this:

     [,1] [,2]  [,3] ...
ind1  10   100   1000
ind2  10   010   0100
ind3  01   001   0010
ind4  01   001   0001
.
.
.

For me it is only important to get the same binary code for if there are two, three or four different values. I was already calculating how many different values in each column occur, but I am not sure how to change the values to binary format:

df <- apply(df, 2, function(x) length(unique(x)))

Can someone help me?

2
Not clear how you are getting the expected output of '01' '10' where the column values are just 'a' for the first columnakrun
Not clear about the expected output. May be library(pryr);apply(df[-1], 2, function(x) {n <- length(unique(x)); substr(pryr::bits(x), n, n + n-1)})akrun

2 Answers

2
votes

Here is something else to try. A custom function will take each column through apply. First, you can create a vector of numeric values corresponding to unique characters in the column (unique is used, as factor will otherwise alphabetize the order). A string of zeroes the length of the maximum number will be made, and the character position corresponding to each value will then be substituted with "1".

my_fun <- function(x) {
  vec <- as.numeric(factor(x, levels = unique(x)))
  vec_max <- max(na.omit(vec))
  lapply(vec, 
         function(y) ifelse(!is.na(y), 
                            sub(paste0("(.{", y - 1, "})."), 
                                "\\11", 
                                paste0(rep("0", vec_max), collapse = "")), 
                            NA))
}

m[] <- matrix(unlist(apply(m, 2, my_fun)))

Output

     [,1] [,2] [,3]  [,4]  
ind1 "1"  "10" "100" "1000"
ind2 "1"  "10" "010" "0100"
ind3 "1"  "01" "001" "0010"
ind4 "1"  "01" "001" "0001"

Data

m <- structure(c("a", "a", "a", "a", "c", "c", "g", "g", "a", "t", 
"g", "g", "a", "t", "c", "g"), .Dim = c(4L, 4L), .Dimnames = list(
    c("ind1", "ind2", "ind3", "ind4"), NULL))
1
votes

Here is my attempt:

r1 <- c("a","c","a","a")
r2 <- c("a","c","t","t")
r3 <- c("a","g","g","c")
r4 <- c("a","g","g","g")

n.mat <- rbind(r1,r2,r3,r4)

number_to_nucleotide_binary <- function(x,len) {
  out <- rep("0",len)
  out[x] <- "1"
  return(paste(out,collapse = ""))
}

nuc_to_binary <- function(x) {
  
  len <- length(unique(x))
  char <- sort(unique(x))
  out <- x
  
  if(len != 1) {
    pos <- match(x,char)
    out <- sapply(X = pos,FUN = function(x) {number_to_nucleotide_binary(x = x,len = len)})
  }
  
  return(out)
}

apply(X = n.mat,FUN = nuc_to_binary,MARGIN = 2)

Input:

   [,1] [,2] [,3] [,4]
r1 "a"  "c"  "a"  "a" 
r2 "a"  "c"  "t"  "t" 
r3 "a"  "g"  "g"  "c" 
r4 "a"  "g"  "g"  "g" 

Output:

     [,1] [,2] [,3]  [,4]  
[1,] "a"  "10" "100" "1000"
[2,] "a"  "10" "001" "0001"
[3,] "a"  "01" "010" "0100"
[4,] "a"  "01" "010" "0010"