I work with cluster based on Tesla M2090 (Fermi) and with another one based on K20Xm (Kepler). My kernel launched on Fermi cluster is 2,5 times faster than Kepler. This kernel was compiled for Kepler cluster with keys -arch=sm_35 --ptxas-options=-v, the result is
ptxas info : Compiling entry function '_Z22_repack_one_thread_8_2ILb1EEviPtPPh' for 'sm_35'
ptxas info : Function properties for _Z22_repack_one_thread_8_2ILb1EEviPtPPh
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 18 registers, 344 bytes cmem[0]
So using 1024 threads with 18 register per thread and 0 bytes shared memory I have 100% multiprocessor occupancy.
What is the possible reason of the more slowly node perfomance based on Kepler?
Thank you.
Voytsekh
UPDATE
My kernel
template <bool nocheck>
__global__ void _repack_one_thread_8_2 (int size, word *input, byte **outputs)
{
int idx = blockDim.x * blockIdx.x + threadIdx.x;
if (nocheck || idx * 8 < size)
{
word *ptr = input + idx * 4;
byte bytes[8] = {0,0,0,0,0,0,0,0};
int i, j;
for (i = 0; i < 4; i++, ptr++)
{
word b = *ptr;
for (j = 0; j < 8; j++)
bytes[j] |= (((b >> (j * 2)) & 3) << (i * 2));
}
for (i = 0; i < 8; i++)
outputs[i][idx] = bytes[i];
}
}
Compile command for Kepler
nvcc -arch=sm_35 --ptxas-options=-v -c -O3 -I.. -o
Compile command for Fermi
nvcc -arch=sm_20 --ptxas-options=-v -c -O3 -I.. -o
-arch=sm_35
? Are you sure you are getting correct results for the Fermi node? – Vitality-arch=sm_35
". Should you fix it? Also, it would be interesting to know how are you launching the kernel for the two architectures. Generally speaking, it could be necessary to separately optimize the launch parameters for maximizing performance on the Fermi and Kepler. – Vitality