How to speed up parallel loading (& unloading) of matrices onto multiple GPUs in Matlab

Question

I am trying to implement an algorithm involving large dense matrices in Matlab. I am using multi-GPU AWS instances for performance.

At each iteration, I have to work with two large m by n matrices (of doubles), A and B, where m = 1600000, and n = 500. Due to the size of the matrices and the memory capacity of each GPU (~8 GB memory each), I decompose the problem by partitioning the matrices row-wise into K chunks of smaller matrices who has the same number of n columns but fewer rows (M /K).

In theory, I can load each chunk of data onto the GPU one at a time, perform computations, and gather the data before repeating with the next chunk. However, since I have access to 4 GPUs, I would like to use all 4 GPUs in parallel to save time, and decompose the matrices into 4 chunks.

To achieve this, I tried using the parfor loop in Matlab (with the parallel computing toolbox), utilizing best practices such as slicing, loading only relevant data for each worker. For posterity, here is a complete code snippet. I have provided small, decomposed problems deeper down in this post.

M = 1600000;
K = 4;
m = M/K;
n = 500;
A = randn(K, m,n);
B = randn(K,m,n);
C = randn(n,2);
D = zeros(K,m,2);

%delete(gcp('nocreate'));
%p = parpool('local',K);

tic
toc_load = zeros(K,1);
toc_compute = zeros(K,1);
toc_unload = zeros(K,1);
parfor j = 1:K
    tic
    A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
    B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
    C_blk = gpuArray(C);
    D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
    toc_load(j)  = toc;
    tic
    B_blk = D_blk * C_blk' + A_blk + B_blk;
    toc_compute(j) = toc;
    tic
    B(j,:,:) = gather(B_blk);
    toc_unload(j) = toc;
end
toc_all = toc;
fprintf('averaged over 4 workers, loading onto GPU took %f seconds \n', mean(toc_load));
fprintf('averaged over 4 workers, computation on GPU took %f seconds \n',mean(toc_compute));
fprintf('averaged over 4 workers, unloading from GPU took %f seconds \n', mean(toc_unload));
fprintf('the entire process took %f seconds \n', toc_all);

Using the tic-toc time checker (I run the code only after starting the parpool to ensure that time-tracker is accurate), I found that each worker takes on average:

6.33 seconds to load the data onto the GPU
0.18 seconds to run the computations on the GPU
4.91 seconds to unload the data from the GPU.

However, the entire process takes 158.57 seconds. So, the communication overhead (or something else?) took up a significant chunk of the running time.

I then tried a simple for loop without parallelization, see snippet below.

%% for loop
tic
for j = 1:K
    A_blk = gpuArray(reshape(A(j,:,:),[m,n]));
    B_blk = gpuArray(reshape(B(j,:,:), [m,n]));
    C_blk = gpuArray(C);
    D_blk = gpuArray(reshape(D(j,:,:), [m,2]));
    toc_load(j)  = toc;
    B_blk = D_blk * C_blk' + A_blk + B_blk;
    toc_compute(j) = toc;
    B(j,:,:) = gather(B_blk);
end
toc_all = toc;
fprintf('the entire process took %f seconds \n', toc_all);

This time, running the entire code took only 27.96 seconds. So running the code in serial significantly improved performance in this case. Nonetheless, given that I have 4 GPUs, it seems disappointing to not be able to gain a speedup by using all 4 at the same time.

From my experiments above, I have observed that the actual computational cost of the GPU working on the linear algebra tasks appears low. The key bottleneck appears to be the time taken in loading the data in parallel from CPU onto the multiple GPUs, and gathering the data from the multiple GPUs back to CPU, though it is also possible that there is some other factor in play.

In lieu of this, I have the following questions:

What exactly is underlying the slowness of parfor? Why is the communication overhead (or whatever the underlying reason) so expensive?
How can I speed up the parallel loading and unloading of data from CPU to multiple GPUs and then back in Matlab? Are there tricks involving parfor, spmd (or other things such as parfeval, which I have not tried) that I have neglected? Or have I reached some kind of fundamental speed limit in Matlab (assuming I maintain my current CPU/GPU setup) ?
If there is a fundamental limitation in how Matlab handles the data loading/unloading, would the only recourse be to rewrite this portion of the code in C++?

Thank you for any assistance!

As with all things like this one has to consider where the true bottleneck is. It's often down at the microelectronic level. Cueing up transfers to 4 GPUs at once is likely to result in 4 separate chains of DMA transfers fighting each other. The memory subsystem on the CPU will be fetching a little bit from here, a little bit from there, back again. Memory has some addressing latency, and this adds up. It all rather depends on the architecture of the machine on which the code is running; that's pretty well hidden from you on an AWS instance I should think. — bazza
1/2 I agree that multiple workers accessing different parts of memory is liable to cause cache misses, and that it has the potential to contribute to the loss of performance. Additionally, I also agree that hardware limitations, such as the PCIe spec that connects the GPUs to the CPU on the AWS instance, are unknown, and may easily contribute to the decrease in performance. That being said, I don't think these hardware limitations are the primary contributor to the performance decrease. — The Obscure Question
2/2 I come from machine learning land, and one of the main reasons behind the original CPU to GPU paradigm shift is the ability to train on large batch sizes that saturate GPU memory. Loading/unloading batches onto a GPU for training is analogous to the issue we're facing, as in both cases, GPU memory may be saturated. Importantly, training on large batch sizes w/ deep learning frameworks (e.g. TF, pytorch, etc.) do not run into any performance issues like what I'm facing, despite commonly being run on AWS. If they did, training on GPUs would likely be quite infeasible. — The Obscure Question
This is what leads me to suspect that the issue is in a limitation (or our misuse of) Matlab. — The Obscure Question
Small question betraying my lack of knowledge of MatLab - how does that code target 4 separate GPUs? Is it hidden (or, supposed to be hidden) in the parfor? — bazza

Edric Edric · Accepted Answer · 2019-04-29T09:02:51

Sending data to/from AWS instances to use with parfor is considerably slower than using workers on your local machine because (a) the machines are further away, and (b) there's additional overhead because all communication with AWS workers use secure communication.

You can use ticBytes and tocBytes to see how much data is being transferred.

To improve the performance, I would suggest doing everything possible to avoid transferring large amounts of data between your client and the workers. It can often be more efficient to build data directly on the workers, even if this means building arrays redundantly multiple times.

Precisely how you avoid data transfer is highly dependent on where your original fundamental data is coming from. If you have files on your client system... that's tough. In your example, you're using rand - which is easy to run on the cluster, but presumably not really representative.

Sometimes there's a middle ground where you have some small-ish fundamental data that can only be computed at the client, and large derived data that is needed on the workers. In that case, you might conceivably couple the computation with parallel.pool.Constant, or just do everything inside a single spmd block or something. (Your parfor loop as written could equally use spmd since you're arranging things to have one iteration per worker).

How to speed up parallel loading (& unloading) of matrices onto multiple GPUs in Matlab

1 Answers