4
votes

I have a matrix that contains 3 columns and in total 10,000 elements. First and second columns are indexes and third column is the score. I want to normalize the score column based on this formula:

Normalized_score_i_j = score_i_j / ((sqrt(score_i_i) * (sqrt(score_j_j))

score_i_j = the current score itself

score_i_i = look at current score's index in first column, and in the dataset look for a score that has that index in both its first and second columns

score_j_j = look at current score's index in second column, and in the dataset look for a score that has that index in both its first and second columns

An example is for instance, if df is as follow:

df <- read.table(text = "
First.Protein,Second.Protein,Score
1,1,25
1,2,90
1,3,82
1,4,19
2,1,90
2,2,99
2,3,76
2,4,79
3,1,82
3,2,76
3,3,91
3,4,33
4,1,28
4,2,11
4,3,99
4,4,50
", header = TRUE, sep = ",")

If we are normalizing this row:

First.Protein Second.Protein Score
4             3              99

The normalized score will be:

The score itself divided by the sqrt of a score that its First.Protein and Second.Protein index are both 4 multiplied by the sqrt of a score where its First.Protein and Second.Protein indexes are both 3.

Therefore:

Normalized =  99 / (sqrt(50) * sqrt(91)) = 1.467674

I have the code below, but it is behaving very weirdly and is giving me values that are not at all normalized and are in fact very odd:

for(i in 1:nrow(Smith_Waterman_Scores))
{
  Smith_Waterman_Scores$Score[i] <- 
    Smith_Waterman_Scores$Score[i] / 
    (sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$First.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$First.Protein[i])])) *
    (sqrt(Smith_Waterman_Scores$Score[which(Smith_Waterman_Scores$First.Protein==Smith_Waterman_Scores$Second.Protein[i] & Smith_Waterman_Scores$Second.Protein==Smith_Waterman_Scores$Second.Protein[i])]))
}
5

5 Answers

5
votes

Here's a re-write of your original attempt (which() is not necessary; just use the logical vector for sub-setting; with() allows you to refer to variables in the data frame without having to re-type the name of the data.frame -- easier to read but also easier to make a mistake)

orig0 <- function(df) {
    for(i in 1:nrow(df)) {
        df$Score[i] <- with(df, {
            ii <- First.Protein == First.Protein[i] &
                Second.Protein == First.Protein[i]
            jj <- First.Protein == Second.Protein[i] &
                Second.Protein == Second.Protein[i]
            Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
        })
    }
    df$Score
}

The problem is that Score[ii] and Score[jj] appear on the right-hand side both before and after they've been updated. Here's a revision where the original columns are interpreted as 'read-only'

orig1 <- function(df) {
    normalized <- numeric(nrow(df))     # pre-allocate
    for(i in 1:nrow(df)) {
        normalized[i] <- with(df, {
            ii <- First.Protein == First.Protein[i] &
                Second.Protein == First.Protein[i]
            jj <- First.Protein == Second.Protein[i] &
                Second.Protein == Second.Protein[i]
            Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
        })
    }
    normalized
}

I think the results are now correct (see below). A better implementation would use sapply (or vapply) to avoid having to worry about the allocation of the return value

orig2 <- function(df) {
    sapply(seq_len(nrow(df)), function(i) {
        with(df, {
            ii <- First.Protein == First.Protein[i] &
                Second.Protein == First.Protein[i]
            jj <- First.Protein == Second.Protein[i] &
                Second.Protein == Second.Protein[i]
            Score[i] / (sqrt(Score[ii]) * sqrt(Score[jj]))
        })
    })
}

Now that the results are correct, we can ask about performance. Your solution requires a scan of, e.g., First.Protein, each time through the loop. There are N=nrow(df) elements of First.Protein, and you're going through the loop N times, so you'll be making a multiple of N * N = N^2 comparisons -- if you increase the size of the data frame from 10 to 100 rows, the time taken will change from 10 * 10 = 100 units, to 100 * 100 = 10000 units of time.

Several of the answers attempt to avoid that polynomial scaling. My answer does this using match() on a vector of values; this probably scales as N (each look-up occurs in constant time, and there are N look-ups), which is much better than polynomial.

Create a subset of data with identical first and second proteins

ii = df[df$First.Protein == df$Second.Protein,]

Here's the ijth score from the original data frame

s_ij = df$Score

Look up First.Protein of df in ii and record the score; likewise for Second.Protein

s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"]
s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]

The normalized scores are then

> s_ij / (sqrt(s_ii) * sqrt(s_jj))
 [1] 1.0000000 1.8090681 1.7191871 0.5374012 1.8090681 1.0000000 0.8007101
 [8] 1.1228571 1.7191871 0.8007101 1.0000000 0.4892245 0.7919596 0.1563472
[15] 1.4676736 1.0000000

This will be fast, using a single call to match() instead of many calls to which() inside a for loop or tests for identity inside an apply() -- both of the latter make N^2 comparisons and so scale very poorly.

I summarized some of the proposed solutions as

f0 <- function(df) {
    contingency = xtabs(Score ~ ., df)
    diagonals <- unname(diag(contingency))
    i <- df$First.Protein
    j <- df$Second.Protein
    idx <- matrix(c(i, j), ncol=2)
    contingency[idx] / (sqrt(diagonals[i]) * sqrt(diagonals[j]))
}

f1 <- function(df) {
    ii = df[df$First.Protein == df$Second.Protein,]
    s_ij = df$Score
    s_ii = ii[match(df$First.Protein, ii$First.Protein), "Score"]
    s_jj = ii[match(df$Second.Protein, ii$Second.Protein), "Score"]
    s_ij / (sqrt(s_ii) * sqrt(s_jj))
}

f2 <- function(dt) {
    dt.lookup <- dt[First.Protein == Second.Protein]
    setkey(dt,"First.Protein" )
    setkey(dt.lookup,"First.Protein" )
    colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1")
    dt <- dt[dt.lookup]
    setkey(dt,"Second.Protein" )
    setkey(dt.lookup,"Second.Protein")
    colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2")
    dt[dt.lookup][
      , Normalized :=  Score / (sqrt(Score1) * sqrt(Score2))][
      , .(First.Protein, Second.Protein, Normalized)]
}

f3 <- function(dt) {
    eq = dt[First.Protein == Second.Protein]
    dt[eq, Score_ii := i.Score, on = "First.Protein"]
    dt[eq, Score_jj := i.Score, on = "Second.Protein"]
    dt[, Normalised := Score/sqrt(Score_ii * Score_jj)]
    dt[, c("Score_ii", "Score_jj") := NULL]
}

I know how to programmatically check that the first two generate consistent results; I don't know data.table well enough to get the normalized result out in the same order as the input columns for f2() so can't compare with the others (though they look correct 'by eye'). f3() produces numerically similar but not identical results

> identical(orig1(df), f0(df))
[1] TRUE
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), { f3(dt3); dt3[["Normalized"]] })  # pass by reference!
[1] FALSE
> all.equal(f0(df), { f3(dt3); dt3[["Normalized"]] })
[1] TRUE

There are performance differences

library(data.table)    
dt2 <- as.data.table(df)
dt3 <- as.data.table(df)

library(microbenchmark)
microbenchmark(f0(df), f1(df), f2(dt2), f3(dt3))

with

> microbenchmark(f0(df), f1(df), f2(df), f3(df))
Unit: microseconds
   expr      min        lq      mean    median       uq      max neval
 f0(df)  967.117  992.8365 1059.7076 1030.9710 1094.247 2384.360   100
 f1(df)  176.238  192.8610  210.4059  207.8865  219.687  333.260   100
 f2(df) 4884.922 4947.6650 5156.0985 5017.1785 5142.498 6785.975   100
 f3(df) 3281.185 3329.4440 3463.8073 3366.3825 3443.400 5144.430   100

The solutions f0 - f3 are likely to scale well (especially data.table) with real data; the fact that the times are in microseconds probably means that speed is not important (now that we are not implementing an N^2 algorithm).

On reflection, a more straight-forward impelementation of f1() just looks up the 'diagonal' elements

f1a <- function(df) {
    ii = df[df$First.Protein == df$Second.Protein, ]
    d = sqrt(ii$Score[order(ii$First.Protein)])
    df$Score / (d[df$First.Protein] * d[df$Second.Protein])
}    
4
votes

You may be doing this in a very round-about manner. Can you see if this works for you:

R> xx
    First Second Score
1      1      1    25
2      1      2    90
3      1      3    82
4      1      4    19
5      2      1    90
6      2      2    99
7      2      3    76
8      2      4    79
9      3      1    82
10     3      2    76
11     3      3    91
12     3      4    33
13     4      1    28
14     4      2    11
15     4      3    99
16     4      4    50
R> contingency = xtabs(Score ~ ., data=xx)
R> contingency
    Second
First  1  2  3  4
    1 25 90 82 19
    2 90 99 76 79
    3 82 76 91 33
    4 28 11 99 50
R> diagonals <- unname(diag(contingency))
R> diagonals
[1] 25 99 91 50

R> normalize <- function (i, j, contingencies, diagonals) {
+      contingencies[i, j] / (sqrt(diagonals[i]) * sqrt(diagonals[j]))
+  }

R> normalize(4, 3, contingency, diagonals)
[1] 1.467674
3
votes

Here's how I'd approach using data.table. Hopefully @MartinMorgan finds this easier to understand :-).

require(data.table) # v1.9.6+
dt = as.data.table(df) # or use setDT(df) to convert by reference
eq = dt[First.Protein == Second.Protein]

So far, I've just created a new data.table eq which contains all rows where both columns are equal.

dt[eq, Score_ii := i.Score, on = "First.Protein"]
dt[eq, Score_jj := i.Score, on = "Second.Protein"]

Here we add columns Score_ii and Score_jj while joining on columns First.Protein and Second.Protein. That it is a join operation should be clear because of on= argument. The i. refers to the Score column in the data.table provided in the i-argument (here, eq's Score).

Note that we can use match() here as well. But that wouldn't work if you've to lookup directly (and as efficiently) based on more than one column. Using on=, we can extend this quite easily, and is also much easier to read/understand.

Once we've all the required columns, the task is just to get the final Normalised column (and delete the intermediates if they're not necessary).

dt[, Normalised := Score/sqrt(Score_ii * Score_jj)]
dt[, c("Score_ii", "Score_jj") := NULL] # delete if you don't want them

I'll leave out the micro- and milli- second benchmarks as I'm not interested in them.


PS: The columns Score_ii and Score_jj are added above on purpose under the assumption that you might need them. If you don't want them at all, you can also do:

Score_ii = eq[dt, Score, on = "First.Protein"] ## -- (1)
Score_jj = eq[dt, Score, on = "Second.Protein"]

(1) reads: for each row in dt get matching row in eq while matching on column First.Protein and extract eq$Score corresponding to that matching row.

Then, we can directly add the Normalised column as:

dt[, Normalised := Score / sqrt(Score_ii * Score_jj)]
2
votes

You can can implement this with joins, here is an example using data.table:

library(data.table)
dt <- data.table(df)

dt.lookup <- dt[First.Protein == Second.Protein]
setkey(dt,"First.Protein" )
setkey(dt.lookup,"First.Protein" )
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score1")
dt <- dt[dt.lookup]
setkey(dt,"Second.Protein" )
setkey(dt.lookup,"Second.Protein")
colnames(dt.lookup) <- c("First.Protein","Second.Protein","Score2")
dt <- dt[dt.lookup][
   , Normalized :=  Score / (sqrt(Score1) * sqrt(Score2))][
  , .(First.Protein, Second.Protein, Normalized)]

Just make sure you don't use for loops.

1
votes

Loop through rows using apply:

#compute
df$ScoreNorm <- 
  apply(df, 1, function(i){
    i[3] /
      (
        sqrt(df[ df$First.Protein == i[1] &
                   df$Second.Protein == i[1], "Score"]) *
          sqrt(df[ df$First.Protein == i[2] &
                     df$Second.Protein == i[2], "Score"])
      )
  })

#test output
df[15, ]
#    First.Protein Second.Protein Score ScoreNorm
# 15             4              3    99  1.467674