I am working on some CUDA program and I wanted to speed up computation using constant memory but it turned that using constant memory makes my code ~30% slower.
I know that constant memory is good at broadcasting reads to whole warps and I thought that my program could take an advantage of it.
Here is constant memory code:
__constant__ float4 constPlanes[MAX_PLANES_COUNT];
__global__ void faultsKernelConstantMem(const float3* vertices, unsigned int vertsCount, int* displacements, unsigned int planesCount) {
unsigned int blockId = __mul24(blockIdx.y, gridDim.x) + blockIdx.x;
unsigned int vertexIndex = __mul24(blockId, blockDim.x) + threadIdx.x;
if (vertexIndex >= vertsCount) {
return;
}
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
//__syncthreads();
for (unsigned int planeIndex = 0; planeIndex < planesCount; ++planeIndex) {
float4 plane = constPlanes[planeIndex];
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
}
displacements[vertexIndex] = displacementSteps;
}
Global memory code is the same but it have one parameter more (with pointer to array of planes) and uses it instead of global array.
I thought that those first global memory reads
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
may cause "desynchronization" of threads and then they will not take an advantage of broadcasting of constant memory reads so I've tried to call __syncthreads(); before reading constant memory but it did not changed anything.
What is wrong? Thanks in advance!
System:
- CUDA Driver Version: 5.0
- CUDA Capability: 2.0
Parameters:
- number of vertices: ~2.5 millions
- number of planes: 1024
Results:
- constant mem version: 46 ms
- global mem version: 35 ms
EDIT:
So I've tried many things how to make the constant memory faster, such as:
1) Comment out the two global memory reads to see if they have any impact and they do not. Global memory was still faster.
2) Process more vertices per thread (from 8 to 64) to take advantage of CM caches. This was even slower then one vertex per thread.
2b) Use shared memory to store displacements and vertices - load all of them at beginning, process and save all displacements. Again, slower than shown CM example.
After this experience I really do not understand how the CM read broadcasting works and how can be "used" correctly in my code. This code probably can not be optimized with CM.
EDIT2:
Another day of tweaking, I've tried:
3) Process more vertices (8 to 64) per thread with memory coalescing (every thread goes with increment equal to total number of threads in system) -- this gives better results than increment equal to 1 but still no speedup
4) Replace this if statement
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
which is giving 'unpredictable' results with little bit of math to avoid branching using this code:
float dist = v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w;
int distInt = (int)(dist * (1 << 29)); // distance is in range (0 - 2), stretch it to int range
int sign = 1 | (distInt >> (sizeof(int) * CHAR_BIT - 1)); // compute sign without using ifs
displacementSteps += sign;
Unfortunately this is a lot of slower (~30%) than using the if so ifs are not that big evil as I thought.
EDIT3:
I am concluding this question that this problem probably can not be improved by using constant memory, those are my results*:

*Times reported as median from 15 independent measurements. When constant memory was not large enough for saving all planes (4096 and 8192), kernel was invoked multiple times.
__syncthreads()has different purpose. You use it when you want to synchronize block level threads, e.g. when you make use of shared memory. For this case is out of question. - KiaMorot