1
votes

Given a data frame like below:

set.seed(123)
df1 <- data.frame(V1=sample(c(0,1,2),100,replace=TRUE),
  V2=sample(c(2,3,4),100,replace=TRUE),
  V3=sample(c(4,5,6),100,replace=TRUE),
  V4=sample(c(6,7,8),100,replace=TRUE),
  V5=sample(c(6,7,8),100,replace=TRUE))

I want to sum each row, starting from the first column with a value >=2, and ending with the column with a value >6, else sum until the end of the row.

How would I do this in a vectorized fashion?

Update: This is not for any homework assignment. I just want more examples of vectorization code that I can study and learn from. I had to do something like the above before, but couldn't figure out the apply syntax for this particular task and resorted to for loops.

4
I don't understand the two close votes, but perhaps it relates to your last sentence which asked for external resources (and I deleted). I also suspect the problem is fundamentally not a task for which vectorization offers much promise. You really ought to describe the underlying task (at least if it's not just a CS HW problem). - IRTFM

4 Answers

3
votes

This is what appeared the most R-like approach but I don't consider it "vectorized" in the R meaning of the term:

apply( df1, 1, function(x) sum( x[which(x>=2)[1]: min(which(x>6)[1], 5, na.rm=TRUE)] ) )
#---------
  [1] 15 22 16 19 17 17 23 21 14 13 18 13 16 23 15 18 16 21 16 19 17 23 21 18
 [25] 21 24 15 20 15 18 17 24 19 18 19 15 18 17 15 17 14 21 13 19 15 15 15 15
 [49] 21 19 21 15 17 18 14 17 15 16 22 16 23 22 17 21 17 16 23 23 16 14 18 13
 [73] 18 15 17 17 17 20 20 16 17 16 16 16 14 16 20 23 23 24 14 18 16 17 22 23
 [97] 23 19 20 17
2
votes

Due to your sampling structure, we can vectorize quite easily.

We know that only the first column can be less than 2, and thus excluded, and that columns V2, V3 and V4 must be included, as they are either below 6, or the first non six. Column V5 is excluded, only if column V4 was above 6.

So:

(df1$V1 == 2) * df1$V1 + df1$V2 + df1$V3 + df1$V4 + df1$V5 * !(df1$V4 > 6)

  [1] 15 22 16 19 17 17 23 21 14 13 18 13 16 23 15 18 16 21 16 19 17 23 21 18 21 24 15 20 15 18 17 24 19 18
 [35] 19 15 18 17 15 17 14 21 13 19 15 15 15 15 21 19 21 15 17 18 14 17 15 16 22 16 23 22 17 21 17 16 23 23
 [69] 16 14 18 13 18 15 17 17 17 20 20 16 17 16 16 16 14 16 20 23 23 24 14 18 16 17 22 23 23 19 20 17

is your vectorized calculation. This is obviously much less general than the other answers here, but fits your question.

1
votes

Using apply would be the most sensible solution. However, since we seem to be competing on who can answer this without using R-based loops, I humbly offer this

m<-as.matrix(df1)
start<-max.col(m>=2,ties="first")
end<-max.col(`[<-`(m>6,,ncol(m),TRUE),ties="first")
i<-t(matrix(1:ncol(m),nrow=ncol(m),ncol=nrow(m)))
rowSums(m*(i>=start & i<=end))

Output is the same as these answres.

0
votes

I am sure that there is a more elegant way but for a brute force approach you can write a function and pass it to apply.

First, define your example data

df <- data.frame(V1=sample(c(0,1,2),100,replace=TRUE),
                 V2=sample(c(2,3,4),100,replace=TRUE),
                 V3=sample(c(4,5,6),100,replace=TRUE),
                 V4=sample(c(6,7,8),100,replace=TRUE),
                 V5=sample(c(6,7,8),100,replace=TRUE))

Write a function that will define the conditional statement. The use of which returns the position of the condition in the vector. The first use of which "start" pulls the position of the first occurrence of the condition thus the bracket use of [1]. Since there are multiple potential outcomes of the end position I used an if statement to fulfill it. If there is not a value that meets the condition > 6 for "end" the variable is assigned the last position of the vector otherwise the position the meets the condition. Then it is just a matter of subsetting the vector based on the start and end values to be evaluated using sum.

sum.col <- function(x) {
  start <- which(x >= 2)[1]
  end <- which(x > 6)
    if( length(end) == 0 ) {
      end <- length(x)
    } else {
      end <- end[length(end)]
  }
  return( sum( x[start:end] ) )  
}

Now we can pass the function to apply, which deals with the vectorization of each row for us.

apply(df, FUN=sum.col, MARGIN = 1)