5
votes

say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.

a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.

b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.

I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?

4

4 Answers

5
votes

As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.

If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.

The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.

Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.

The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.

There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.

3
votes

Have you looked at the CUDA documentation: Cuda Programming Model

Also, sample source code: Matrix Multiplication

2
votes

Did you look at

$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul

i.e. the matrix multiplication example in the SDK?

0
votes

If you don't need to implement this yourself, just use a library -- CUBLAS, MAGMA, etc., provide tuned matrix multiplication implementations.