1
votes

I have two data.frames that looks like:

df1
  Gene name   sample1    sample2    sample3     sample4     sample5  
   A             0          1         0           0           1 
   B             1          0         0           1           0
   C             0          0         1           1           1
   D             1          0         0           1           0



df_final
  Gene name   sample1    sample2    sample3     sample4     sample5  
   A             1          1         1           0           0 
   B             0          1         0           0           0
   C             1          1         0           0           0
   D             1          1         0           0           0

Only values of "0" and "1" are present. I would like a single data.frame in which when an entry in df1 or df2 is == 1 in both data.frames it will be maintained as "1" (the same with "0"). Otherwise, when it is == 1 in one data.frame (df1 for example) and 0 in the other data.frame (df2 for example) the entry will become 1. The two data.frames have the same number of rows and the same number of columns.

The desired output will be:

df1
  Gene name   sample1    sample2    sample3     sample4     sample5  
   A             1          1         1           0           1 
   B             1          1         0           1           0
   C             1          1         1           1           1
   D             1          1         0           1           0

Since I' m new in R I would like to use for loops on the first and second data.frame to learn to loop over multiple data.frames. At the moment I'm not able to do such work. Can anyone help me please?

Best,

E.

3
Do both data frames have the same number of rows, one for each gene?joran
Yes! The same number of rows and the same number of columns!I edit soon!Elb

3 Answers

1
votes

Short way: #df3 <- as.integer(df1+df2>0) #this was wrong

EDIT Short way: df3 <- apply(df1+df2>0, c(1,2), as.integer) #there might be shorter

With loops etc:

df3 <- as.data.frame(matrix(rep(NA, nrow(df1)*ncol(df1)),ncol=ncol(df1))
names(df3) <- names(df1)

for(i in 1:ncol(df1)){
  for(j in 1:nrow(df1)){
    if(i==1){#edited
       df3[j,i] <- df1[j,i]#edited; note, this is dangerous b/c it is assuming the data frames are organized in the same way
    }else{#edited
       df3[j,i] <- as.integer((df1[j,i] + df2[j,i])>0)
    }#edited
  }
}

That work?

3
votes

The "R" way to do this sort of thing is to take advantage of vectorization:

df3 <- df1
> df3[,-1] <- ((df1[,-1] + df2[,-1]) > 0) + 0
> df3
  Genename sample1 sample2 sample3 sample4 sample5
1        A       1       1       1       0       1
2        B       1       1       0       1       0
3        C       1       1       1       1       1
4        D       1       1       0       1       0

The loops are still happening, but under the hood, in much faster compiled code.

A brief explanation:

We can add the numeric portions of the two data frames in a vectorized fashion:

(df1[,-1] + df2[,-1])
  sample1 sample2 sample3 sample4 sample5
1       1       2       1       0       1
2       1       1       0       1       0
3       1       1       1       1       1
4       2       1       0       1       0

Then if we ask which values are greater than zero we get the "right" answer, but in booleans instead of 0's and 1's:

> (df1[,-1] + df2[,-1]) > 0
     sample1 sample2 sample3 sample4 sample5
[1,]    TRUE    TRUE    TRUE   FALSE    TRUE
[2,]    TRUE    TRUE   FALSE    TRUE   FALSE
[3,]    TRUE    TRUE    TRUE    TRUE    TRUE
[4,]    TRUE    TRUE   FALSE    TRUE   FALSE

Luckily, if we simply add 0, R will coerce the booleans back to integers:

> ((df1[,-1] + df2[,-1]) > 0) + 0
     sample1 sample2 sample3 sample4 sample5
[1,]       1       1       1       0       1
[2,]       1       1       0       1       0
[3,]       1       1       1       1       1
[4,]       1       1       0       1       0
3
votes

What you want is known as a bitwise OR operation: https://en.wikipedia.org/wiki/Bitwise_operation#OR

There are functions for bitwise operations in R 3.0: bitwAnd, bitwNot, bitwOr, bitwShiftL, bitwShiftR and bitwXor (bitwOr is the one you are looking for).

The answer joran gave works fine, but if you are running R 3.0 I would suggest using bitwise operations, since they tend to work faster:

 > system.time(for (i in 1:10000) {df3[,-1] <- ((df1[,-1] + df2[,-1]) > 0) + 0})
   user  system elapsed 
  13.58    0.00   13.59

 > system.time(for (i in 1:10000) {df3[,-1] = bitwOr(unlist(df1[,-1]), unlist(df2[,-1]))})
   user  system elapsed 
   5.44    0.00    5.45