3
votes

I'm running a simulation where I need to repeatedly extract 1 column from a matrix and check each of its values against some condition (e.g. < 10). However, doing so with a matrix is 3 times slower than doing the same thing with a data.frame. Why is this the case?

I'd like to to use matrixes to store the simulation data because they are faster for some other operations (e.g. updating columns by adding/subtracting values). How can I extract columns / subset a matrix in a faster way?

Extract column from data.frame vs matrix:

df <- data.frame(a = 1:1e4)
m <- as.matrix(df)

library(microbenchmark)
microbenchmark(
  df$a, 
  m[ , "a"])

# Results; Unit: microseconds
#      expr    min      lq     mean median      uq     max neval cld
#      df$a  5.463  5.8315  8.03997  6.612  8.0275  57.637   100   a 
# m[ , "a"] 64.699 66.6265 72.43631 73.759 75.5595 117.922   100   b

Extract single value from data.frame vs matrix:

microbenchmark(
  df[1, 1],
  df$a[1],
  m[1, 1], 
  m[ , "a"][1])  

# Results; Unit: nanoseconds
#         expr   min      lq     mean  median      uq    max neval  cld
#     df[1, 1]  8248  8753.0 10198.56  9818.5 10689.5  48159   100    c 
#      df$a[1]  4072  4416.0  5247.67  5057.5  5754.5  17993   100    b  
#      m[1, 1]   517   708.5   828.04   810.0   920.5   2732   100    a   
# m[ , "a"][1] 45745 47884.0 51861.90 49100.5 54831.5 105323   100    d

I expected the matrix column extraction to be faster, but it was slower. However, extracting a single value from a matrix (i.e. m[1, 1]) was faster than both of the ways of doing so with a data.frame. I'm lost as to why this is.

Extract row vs column, data.frame vs matrix:

The above is only true for selecting columns. When selecting rows, matrices are much faster than data.frames. Still don't know why.

microbenchmark(
  df[1, ],
  m[1, ],
  df[ , 1],
  m[ , 1])

# Result: Unit: nanoseconds
#     expr   min      lq     mean  median      uq   max neval  cld
#  df[1, ] 16359 17243.5 18766.93 17860.5 19849.5 42973   100    c 
#   m[1, ]   718   999.5  1175.95  1181.0  1327.0  3595   100    a   
# df[ , 1]  7664  8687.5  9888.57  9301.0 10535.5 42312   100    b  
#  m[ , 1] 64874 66218.5 72074.93 73717.5 74084.5 97827   100    d
2
Note that m[1,"a"] is faster than subsetting a df.Rui Barradas
it's an interesting theoretical question ... but is extracting columns actually a bottleneck in your simulation code? (extracting a single column of a matrix takes on average 72,000 nanoseconds ... you may be doing this millions of times, but is it the slowest thing you're doing?)Ben Bolker

2 Answers

3
votes

data.frame

Consider the builtin data frame BOD. data frames are stored as a list of columns and the inspect output shown below shows the address of each of the two columns of BOD. We then assign its second column to BOD2. Note that the address of BOD2 is the same memory location as the second column shown in the inspect output for BOD. That is, all R did was have BOD2 point to memory within BOD in order to create BOD2. There was no data movement at all. Another way to see this is to compare the size of BOD, BOD2 and both together and we see that both together take up the same amount of memory as BOD so there must have been no copying. (Continued after code.)

library(pryr)

BOD2 <- BOD[[2]]
inspect(BOD)
## <VECSXP 0x507c278>
##   <REALSXP 0x4f81f48>
##   <REALSXP 0x4f81ed8>  <--- compare this address to address shown below
## ...snip...

BOD2 <- BOD[,2]
address(BOD2)
## [1] "0x4f81ed8"

object_size(BOD)
## 1.18 kB
object_size(BOD2)
## 96 B
object_size(BOD, BOD2)    # same as object_size(BOD) above
## 1.18 kB

matrix

Matrices are stored as one long vector with dimensions rather than as a list of columns so the strategy for extraction of a column is different. If we look at the memory used by a matrix m, an extracted column m2 and both together we see below that both together use the sum of the memories of the individual objects showing that there was data copying.

set.seed(123)

n <- 10000L
m <- matrix(rnorm(2*n), n, 2)
m2 <- m[, 2]

object_size(m)
## 160 kB
object_size(m2)
## 80 kB
object_size(m, m2) 
## 240 kB  <-- unlike for data.frames this equals sum of above

what to do

If your program is such that it uses column extraction up to a point only you could use a data frame for that portion and then do a one time conversion to matrix and process it like that for the rest.

1
votes

I suppose it is about the data structure of R in the memory. A matrix in R is a 2-d array, which is the same of 1-d array. A variable is a point directly to the memory, so it would be very faster to extract a single value. To extract a column in the matrix, it would take some computation and ask for new memory address and save it. As for dataframe, it is actually a list of columns, so it would be faster to return a column. That's what i guess, hope to be proved.