Fastest Conversion of Row-Ordered data to Column-Ordered data

Question

I have an IplImage from openCV, which stores its data in a row-ordered format;

image data is stored in a one dimensional array char *data; the element at position x,y is given by

elem(x,y) = data[y*width + x] // see note at end

I would like to convert this image as quickly as possible to and from a second image format that stores its data in column-ordered format; that is

elem(x,y) = data[x*height + y]

Obviously, one way to do this conversion is simply element-by-element through a double for loop.

Is there a faster way?

note for openCV afficionados, the actual location of elem(x,y) is given by data + y*widthstep + x*sizeof(element) but this gives the general idea, and for char data sizeof(element) = 1 and we can make widthstep = width, so the formula is exact

the double for loop would be O(N) on the number of elements and you can't beat that because you have to copy them all but within your for loops there may be some multiplications you don't need to perform each time. Highly unlikely to make any performance difference. You could of course split the process into multiple threads if the image is large. — CashCow
possible duplicate of A Cache Efficient Matrix Transpose Program? — Oliver Charlesworth
Do you absolutely have to copy it? If you must, there is no faster way. Your copy algorithm is optimal since every element indeed need to be visited. If you don't copy it, consider just swapping the indices - that is, whenever you need to index it, index it with NOT with (i,j) but with (j,i). Can you do that? You can easily see that this needs O(1) time (and perhaps O(1) space). — Juho
@mrm: Big-O notation isn't really relevant here; the speed will be dominated by cache utilisation (i.e. memory-access pattern). — Oliver Charlesworth
@Oli Sorry for nitpicking, but I think it is. You are correct in that sense that the work complexity is optimal for both the "naive algorithm" and for the cache-oblivious transposition (the cache utilization has a major impact, yes!). However the O-notation is used for analyzing cache-oblivious algorithms as well. And as I suggested, the O(1) time and space solution is even faster, if one is able to use it. — Juho

wildplasser wildplasser · Accepted Answer · 2012-02-01T18:05:56

It is called "matrix transposition" Optimal methods try to minimise the number of cache misses, swapping small tiles with the size of one or a few cache slots. For a multi-level cache this will get difficult. start reading here

this one is a bit more advanced

BTW the urls deal with "in place" transposition. Creating a transposed copy will be different (it uses twice as many cache slots, duh!)

Fastest Conversion of Row-Ordered data to Column-Ordered data

3 Answers