Return indices of rows whose elements (columns) all match a reference vector

Question

Using the following code;

  c <- NULL
  for (a in 1:4){
    b <- seq(from = a, to = a + 5)
    c <- rbind(c,b)
    }
  c <- rbind(c,c); rm(a,b)

Results in this matrix,

> c
  [,1] [,2] [,3] [,4] [,5] [,6]
b    1    2    3    4    5    6
b    2    3    4    5    6    7
b    3    4    5    6    7    8
b    4    5    6    7    8    9
b    1    2    3    4    5    6
b    2    3    4    5    6    7
b    3    4    5    6    7    8
b    4    5    6    7    8    9

How can I return row indices for rows matching a specific input?

For example, with a search term of,

z <- c(3,4,5,6,7,8)

I need the following returned,

[1] 3 7

This will be used in a fairly large data frame of test data, related to a time step column, to reduce the data by accumulating time steps for matching rows.

Question answered well by others. Due to my dataset size (9.5M rows), I came up with an efficient approach that took a couple steps.

1) Sort the big data frame 'dc' containing time steps to accumulate in column 1.

dc <- dc[order(dc[,2],dc[,3],dc[,4],dc[,5],dc[,6],dc[,7],dc[,8]),]

2) Create a new data frame with unique entries (excluding column 1).

dcU <- unique(dc[,2:8])

3) Write Rcpp (C++) function to loop through unique data frame which iterates through the original data frame accumulating time while rows are equal and indexes to the next for loop step when an unequal row is identified.

  require(Rcpp)
  getTsrc <-
    '
  NumericVector getT(NumericMatrix dc, NumericMatrix dcU)
  {
  int k = 0;
  int n = dcU.nrow();
  NumericVector tU(n);
  for (int i = 0; i<n; i++)
    {
    while ((dcU(i,0)==dc(k,1))&&(dcU(i,1)==dc(k,2))&&(dcU(i,2)==dc(k,3))&&
           (dcU(i,3)==dc(k,4))&&(dcU(i,4)==dc(k,5))&&(dcU(i,5)==dc(k,6))&&
           (dcU(i,6)==dc(k,7)))
      {
      tU[i] = tU[i] + dc(k,0);
      k++;
      }
    }
  return(tU);
  }
    '
  cppFunction(getTsrc)

4) Convert function inputs to matrices.

  dc1 <- as.matrix(dc)
  dcU1 <- as.matrix(dcU)

5) Run the function and time it (returns time vector matching unique data frame)

  pt <- proc.time()
  t <- getT(dc1, dcU1)
  print(proc.time() - pt)

   user  system elapsed 
   0.18    0.03    0.20

6) Self high-five and more coffee.

Are all numbers integer? If so, is there an upper bound to the integers? — cryo111
You could probably post the new Rcpp part of this as a new question to get more attention. — jeremycg

rbatt rbatt · Accepted Answer · 2015-12-08T16:21:16

The answer by @jeremycg will definitely work, and is fast if you have many columns and few rows. However, you might be able to go a bit faster if you have lots of rows by avoiding using apply() on the row dimension.

Here's an alternative:

l <- unlist(apply(c, 2, list), recursive=F)
logic <- mapply(function(x,y)x==y, l, z)
which(.rowSums(logic, m=nrow(logic), n=ncol(logic)) == ncol(logic))

[1] 3 7

It works by first turning each column into a list. Then, it takes each column-list and searches it for the corresponding element in z. In the last step, you find out which rows had all columns with the corresponding match in z. Even though the last step is a row-wise operation, by using .rowSums (mind the . at the front there) we can specify the dimensions of the matrix, and get a speed-up.

Let's test the timings of the two approaches.

The functions

f1 <- function(){
    which(apply(c, 1, function(x) all(x == z)))
}

f2 <- function(){
    l <- unlist(apply(c, 2, list), recursive=F)
    logic <- mapply(function(x,y)x==y, l, z)
    which(.rowSums(logic, m=nrow(logic), n=ncol(logic)) == ncol(logic))
}

With 8 rows (dim in example):

> time <- microbenchmark(f1(), f2())
> time
Unit: microseconds
 expr    min      lq     mean  median     uq     max neval cld
 f1() 21.147 21.8375 22.86096 22.6845 23.326  30.443   100  a 
 f2() 42.310 43.1510 45.13735 43.7500 44.438 137.413   100   b

With 80 rows:

Unit: microseconds
 expr     min      lq     mean   median       uq     max neval cld
 f1() 101.046 103.859 108.7896 105.1695 108.3320 166.745   100   a
 f2()  93.631  96.204 104.6711  98.1245 104.7205 236.980   100   a

With 800 rows:

> time <- microbenchmark(f1(), f2())
> time
Unit: microseconds
 expr     min       lq      mean    median        uq       max neval cld
 f1() 920.146 1011.394 1372.3512 1042.1230 1066.7610 31290.593   100   b
 f2() 572.222  579.626  593.9211  584.5815  593.6455  1104.316   100  a

Note that my timing assessment only had 100 replicates each, and although these results are reprsentative, there's a bit a of variability in the number of rows required before the two methods are equal.

Regardless, I think my approach would probably be faster once you have 100+ rows.

Also, note that you can't simply transpose c to make f1() faster. First, the t() takes up time; second, because you're comparing to z, you would then just have to make a column-wise (after the transpose) comparison, so it's no different at that point.

Finally, I'm sure there's an even faster way to do this. My answer was just the first thing that came to mind, and didn't require any packages to install. This could be a lot faster if you wanted to use data.table. Also, if you had a lot of columns, you might even be able to parallelize this procedure (although, to be worthwhile the dataset would have to be immense).

If these timings aren't tolerable for your data, you might consider reporting back with the dimensions of your data set.

Return indices of rows whose elements (columns) all match a reference vector

3 Answers