data frame (matrix) performance: memory layout

Question

I am a newbie to R. Assume the memory layout is the same for data frame and matrix.

In the following matrix

a=matrix(1:10000000,1000000,10)

it has 1M rows and 10 columns. Is the memory for row or for column sequential physically? Or is the physical memory first store [1,1],[2,1],[3,1],,[1M,1],[2,1] or [1,2],[1,2],..[1,10],[2,1]...?

Suppose the matrix with 10M element is of size 100M, and the L2 cache is 4M, then L2 cache can't store all these 10M element. If we process the data sequentially, we will have less L2 cache missing ratio. For our case, we need to process row by row and read several columns at the same time, such as column A, B, C, and then create some result. If the layout of the memory is first store 10 items in 1st row, then store 10 items in the 2nd row, then the performance might be better.

If there any way to control the memory layout?

You could try comparing the performance of working with a vs. t(a) to see if rows/column have much of an effect. — Richie Cotton

Spacedman Spacedman · Accepted Answer · 2011-01-19T10:25:54

Matrices are stored column-wise:

> m=matrix(1:12,nrow=3)
> m
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

Data frames are just pretty lists, and lists are stored as vectors of elements. I'm not even sure that list elements are guaranteed to be contiguous in memory.

Read up on writing R extensions for more info on how memory is handled. As far as I know there's no way to control the memory layout. Don't worry about it until it becomes a problem.

data frame (matrix) performance: memory layout

2 Answers