Using cuBLAS-XT for large input size

Question

This link says cuBLAS-XT routines provide out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size. This means that as long as input data can be stored on CPU memory and size of output is greater than GPU memory size we can use cuBLAS-XT functions, right?

On the other hand, this link says "In the case of very large problems, the cublasXt API offers the possibility to offload some of the computation to the Host CPU" and "Currenty, only the routine cublasXtgemm() supports this feature. Is this the case for problems that input size is greater than CPU memory size?

I don't get the difference between these two! I appreciate if someone helps me to understand the difference.

Robert Crovella Robert Crovella · Accepted Answer · 2016-11-05T15:50:24

The purpose of cublasXt is to allow operations to be automatically run on several GPUs. So, for example, a matrix multiply, or other supported operations, can run on several GPUs.

The cublasXtgemm routine has a special capability, that in addition to parallelizing a matrix multiply across 2 or more GPUs, it can also parallelize it across 2 or more GPUs PLUS use the host CPU as an additional computation engine.

The matrix multiply problem is readily decomposable as discussed here. If you run all the "chunks" of work on just GPUs, that is the ordinary capability of cublasXtgemm (to just use GPUs). If you run all but one of the chunks of work on the GPUs and run one of the chunks of work on the CPU, that is the special capability.

Using cuBLAS-XT for large input size

1 Answers