2
votes

I have a linux box with 2 GTX 590 cards (4 GPUs). With the CUDA 4.0 driver, I am able to invoke GPUDirect memory access and verify successful copies between ALL possible pairs of my 4 GPUs.

However, after I upgraded to CUDA 4.1 driver (or any subsequent driver), I am limited in GPUDirect access pairs.

For example peer-to-peer is enabled between the following under CUDA 4.0:

GPU0 <-> GPU1

GPU0 <-> GPU2

GPU0 <-> GPU3

GPU1 <-> GPU2

GPU1 <-> GPU3

GPU2 <-> GPU3

But under CUDA 4.1 (or later) I am limited to access between only:

GPU0 <-> GPU1 (same card)

GPU2 <-> GPU3 (same card)

GPU1 <-> GPU3

Can anyone explain this or know of a workaround when using the latest CUDA 5.x drivers?


$ lspci -tv (the interesting part) gives:

-[0000:00]-+-00.0  ATI Technologies Inc RD890 Northbridge only single slot PCI-e GFX Hydra part
       +-02.0-[0c-0f]----00.0-[0d-0f]--+-00.0-[0f]--+-00.0  nVidia Corporation Device 1088
       |                               |            \-00.1  nVidia Corporation GF110 High Definition Audio Controller
       |                               \-02.0-[0e]--+-00.0  nVidia Corporation Device 1088
       |                                            \-00.1  nVidia Corporation GF110 High Definition Audio Controller
       :
       +-0b.0-[04-07]----00.0-[05-07]--+-00.0-[07]--+-00.0  nVidia Corporation Device 1088
       |                               |            \-00.1  nVidia Corporation GF110 High Definition Audio Controller
       |                               \-02.0-[06]--+-00.0  nVidia Corporation Device 1088
       |                                            \-00.1  nVidia Corporation GF110 High Definition Audio Controller

To me it looks like all paths are physically available (tree like structure), and they are when using cuda 4.0, but when using cuda 4.1 and up cudaDeviceCanAccessPeer() gives false for "cross card" communications. Note, ALL host to device paths are available always (of course).

1

1 Answers

4
votes

Enabling CUDA peer to peer access is managed by the GPU driver, which inspects the system configuration to determine if Peer-to-Peer access is likely to work.

For example, Peer access is not enabled when direct communication between 2 devices would have to travel over a QPI link, as referred to here.

Therefore the GPU driver inspects the system configuration and makes a decision about whether to enable peer access based on whether or not the system topology is recognizable and whether or not the recognized topology fits some heuristics to determine whether or not peer to peer support will be successful.

In your case, if you can communicate between devices on the same card, that simply means that the GPU driver topology recognition heuristics indicate that when the only intervening device is the PCIE switch on the card, Peer to Peer will be successful and so it is enabled (and cudaDeviceCanAccessPeer will return true).

In your case, I would say that if you can successfully enable Peer access between devices on the same card but not in any other scenario, then probably your system topology is falling into some sort of "unrecognized" scenario or possibly a blacklisted scenario. In other words, it's probably expected behavior.

If you can enable Peer access between devices on the same card, and also between some pairs of devices on different cards, but not other pairs of devices on different cards, that is probably a machine configuration issue or a bug.

The management heuristics and whitelists and blacklists maintained by the driver may change from driver version to driver version, which explains why you are seeing the difference in behavior as you move from older to newer versions. (Yes, the heuristics can become more restrictive as you move to newer versions.)

For example, it might be the case that when the heuristics were originally defined in the 270.41.19 driver that shipped with CUDA 4.0, that the RD890 chipset was considered "safe" for PCIE P2P. Later on, based on testing or customer reports, it might have been found that some incarnations of motherboards with RD890 had some sort of problem with P2P. Subsequently, P2P might therefore have been "shut off" in the driver for RD890 based systems. I don't know this to be true for RD890, I'm just giving an example of what might have occurred to show the rationale for why the heuristics could become more restrictive over time.

I offer the above not as a complete explanation of your case, because if you can enable P2P between some GPUs on different cards but not between other GPUs on different cards, then that sounds like unexpected behavior to me. The remainder of my description is just background information.

Your description is not entirely clear to me, because in the first instance you indicate that:

GPU0 <-> GPU1 (same card)

GPU2 <-> GPU3 (same card)

GPU1 <-> GPU3

Are the successful paths. This appears to be unexpected behavior to me, assuming that GPU1 <-> GPU3 represents "cross card" communication.

Later, you indicated:

but when using cuda 4.1 and up cudaDeviceCanAccessPeer() gives false for "cross card" communications.

And if this is true, it could simply be expected behavior based on a modification of the enable heuristics in the driver.

Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.