1
votes

I am working on some CUDA program and I wanted to speed up computation using constant memory but it turned that using constant memory makes my code ~30% slower.

I know that constant memory is good at broadcasting reads to whole warps and I thought that my program could take an advantage of it.

Here is constant memory code:

__constant__ float4 constPlanes[MAX_PLANES_COUNT];

__global__ void faultsKernelConstantMem(const float3* vertices, unsigned int vertsCount, int* displacements, unsigned int planesCount) {

    unsigned int blockId = __mul24(blockIdx.y, gridDim.x) + blockIdx.x;
    unsigned int vertexIndex = __mul24(blockId, blockDim.x) + threadIdx.x;

    if (vertexIndex >= vertsCount) {
        return;
    }

    float3 v = vertices[vertexIndex];
    int displacementSteps = displacements[vertexIndex];

    //__syncthreads();

    for (unsigned int planeIndex = 0; planeIndex < planesCount; ++planeIndex) {
        float4 plane = constPlanes[planeIndex];
        if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
            ++displacementSteps;
        }
        else {
            --displacementSteps;
        }
    }

    displacements[vertexIndex] = displacementSteps;
}

Global memory code is the same but it have one parameter more (with pointer to array of planes) and uses it instead of global array.

I thought that those first global memory reads

float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];

may cause "desynchronization" of threads and then they will not take an advantage of broadcasting of constant memory reads so I've tried to call __syncthreads(); before reading constant memory but it did not changed anything.

What is wrong? Thanks in advance!

System:

  • CUDA Driver Version: 5.0
  • CUDA Capability: 2.0

Parameters:

  • number of vertices: ~2.5 millions
  • number of planes: 1024

Results:

  • constant mem version: 46 ms
  • global mem version: 35 ms

EDIT:

So I've tried many things how to make the constant memory faster, such as:

1) Comment out the two global memory reads to see if they have any impact and they do not. Global memory was still faster.

2) Process more vertices per thread (from 8 to 64) to take advantage of CM caches. This was even slower then one vertex per thread.

2b) Use shared memory to store displacements and vertices - load all of them at beginning, process and save all displacements. Again, slower than shown CM example.

After this experience I really do not understand how the CM read broadcasting works and how can be "used" correctly in my code. This code probably can not be optimized with CM.

EDIT2:

Another day of tweaking, I've tried:

3) Process more vertices (8 to 64) per thread with memory coalescing (every thread goes with increment equal to total number of threads in system) -- this gives better results than increment equal to 1 but still no speedup

4) Replace this if statement

if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
    ++displacementSteps;
}
else {
    --displacementSteps;
}

which is giving 'unpredictable' results with little bit of math to avoid branching using this code:

float dist = v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w;
int distInt = (int)(dist * (1 << 29));  // distance is in range (0 - 2), stretch it to int range
int sign = 1 | (distInt >> (sizeof(int) * CHAR_BIT - 1));  // compute sign without using ifs
displacementSteps += sign;

Unfortunately this is a lot of slower (~30%) than using the if so ifs are not that big evil as I thought.

EDIT3:

I am concluding this question that this problem probably can not be improved by using constant memory, those are my results*:

Graph of global and constant memory performance

*Times reported as median from 15 independent measurements. When constant memory was not large enough for saving all planes (4096 and 8192), kernel was invoked multiple times.

1
__syncthreads() has different purpose. You use it when you want to synchronize block level threads, e.g. when you make use of shared memory. For this case is out of question. - KiaMorot

1 Answers

2
votes

Although a compute capability 2.0 chip has 64k of constant memory, each of the multi-processors has only 8k of constant-memory cache. Your code has each thread requiring access to all 16k of the constant memory, so you are losing performance through cache misses. To effectively use constant memory for the plane data, you will need to restructure your implementation.