2
votes

I work with cluster based on Tesla M2090 (Fermi) and with another one based on K20Xm (Kepler). My kernel launched on Fermi cluster is 2,5 times faster than Kepler. This kernel was compiled for Kepler cluster with keys -arch=sm_35 --ptxas-options=-v, the result is

ptxas info    : Compiling entry function '_Z22_repack_one_thread_8_2ILb1EEviPtPPh' for 'sm_35'
ptxas info    : Function properties for _Z22_repack_one_thread_8_2ILb1EEviPtPPh
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 18 registers, 344 bytes cmem[0]

So using 1024 threads with 18 register per thread and 0 bytes shared memory I have 100% multiprocessor occupancy.

What is the possible reason of the more slowly node perfomance based on Kepler?

Thank you.

Voytsekh

UPDATE

My kernel

template <bool nocheck>
__global__ void _repack_one_thread_8_2 (int size, word *input, byte **outputs)
{
  int idx = blockDim.x * blockIdx.x + threadIdx.x;

  if (nocheck || idx * 8 < size)
  {
    word *ptr = input + idx * 4;
    byte bytes[8] = {0,0,0,0,0,0,0,0};
    int i, j;

    for (i = 0; i < 4; i++, ptr++)
    {
      word b = *ptr;

      for (j = 0; j < 8; j++)
        bytes[j] |= (((b >> (j * 2)) & 3) << (i * 2));
    }

    for (i = 0; i < 8; i++)
      outputs[i][idx] = bytes[i];
  }
}

Compile command for Kepler

nvcc  -arch=sm_35 --ptxas-options=-v  -c -O3 -I.. -o

Compile command for Fermi

nvcc  -arch=sm_20 --ptxas-options=-v  -c -O3 -I.. -o
2
Are you sure to compile for a Tesla M2090 which has compute capability 2.0 with -arch=sm_35? Are you sure you are getting correct results for the Fermi node?Vitality
Course, I have compiled code for Tesla M2090 with sm_20. Also, I have correct result for Fermi and Kepler nodes, but the perfomance time...voitsekh
Interestingly I also have observed such scenario with C2075 and K20c. the C2075 beats the K20c in certain cases, but didnt investigate much though.Sagar Masuti
Your post still reports: "This kernel was compiled for Fermi cluster with keys -arch=sm_35". Should you fix it? Also, it would be interesting to know how are you launching the kernel for the two architectures. Generally speaking, it could be necessary to separately optimize the launch parameters for maximizing performance on the Fermi and Kepler.Vitality
Take a close look at the generated microcode for both cases. I suspect that for whatever reason, the loops are being handled differently by the compiler. The loops look like they could be completely unrolled. (Then again, this code also looks like it would benefit from using the bit field extract and insert intrinsics, which are available on both Fermi and Kepler.)ArchaeaSoftware

2 Answers

1
votes

There are many possible reasons. Simply saying my code runs faster on Fermi with no details of your code isn't much to go on.

It's possible that your kernel is not even running in the Fermi case (code compiled with -arch=sm_35 would not run on a Fermi GPU). This would certainly make the Fermi case look better.

There are many other possibilities as well. A number of possible areas to investigate are covered in the kepler tuning guide.

You should also be doing proper cuda error checking in your code, and try running your code in both cases with cuda-memcheck to get additional information about any kernel execution problems.

-1
votes

I can not recall the detail difference of Fermi and Kepler. Kepler may have less compute unit in a wrap than Fermi but have more wrap can use. I will check it after work.

And 1024 thread seems too less to compare.

So Can you check 1024*1024 threads if it is available.

Then, I do some search. It seems K20Xm have less XSM(14 VS 16 for M2090), less MAD power(384 VS 1332.2 GFLOPs) and less clock rate(732 VS 1301 MHz). By the way, does one XSM can be treated as two SM?

It seems strange..

data from: wiki, integer slow than float, and some white paper