Here are sample codes:
__kernel void my_kernel(__global float* src,
__global float* dst){
float4 a = vload4(0,src);
//do something to a
...
vstore4(a,0,dst)
}
According to OpenCL 1.2 Reference, address of global buffer src and dst must be 4-bytes aligned when using vloadn and vstoren, or the results are undefined. My question is whether OpenCL will automate aligning the global device address after completing the call to clCreateBuffer? If not, how to ensure proper alignment?(in addition, how about local memory object?)