HAAR wavelet transform in CUDA

Question

I have Tried to Implement the HAAR wavelet transform in CUDA for a 1D array.

ALGORITHM

I have 8 indices in the input array

With this condition if(x_index>=o_width/2 || y_index>=o_height/2) I will have 4 threads which should be 0,2,4,6 and I plan to handletwo indices in the input with each one of them.

I calculate the avg.EG: if my thread id is '0'-then avg is (input[0]+input[1])/2 and then at the same time i get the diff which would be input[0]-avg and so on for the rest of the threads.

NOW important thing is the placement of the output.I created a separate thread_id for the output as using indices 0,2,4,6 was creating difficulties with placement of the output in the correct index.

My avgs should be placed in the first 4 indices i.e 0,1,2,3 of the output-and o_thread_id should be 0,1,2,3. Similarly,to place differences at 4,5,6,7 I have incremented 0,1,2,3 with '4' as shown in the code

PROBLEM

My output comes out as all zero!!! No matter what I change I am getting that.

CODE

__global__ void cal_haar(int input[],float output [],int i_widthstep,int o_widthstep,int o_width,int o_height)
{

    int x_index=blockIdx.x*blockDim.x+threadIdx.x;
    int y_index=blockIdx.y*blockDim.y+threadIdx.y;

    if(x_index>=o_width/2 || y_index>=o_height/2) return;

    int i_thread_id=y_index*i_widthstep+(2*x_index);
    int o_thread_id=y_index*o_widthstep+x_index;

    float avg=(input[i_thread_id]+input[i_thread_id+1])/2;
    float diff=input[i_thread_id]-avg;
    output[o_thread_id]=avg;
    output[o_thread_id+4]=diff;

}

void haar(int input[],float output [],int i_widthstep,int o_widthstep,int o_width,int o_height)
{

    int * d_input;
    float * d_output;

    cudaMalloc(&d_input,i_widthstep*o_height);
    cudaMalloc(&d_output,o_widthstep*o_height);

    cudaMemcpy(d_input,input,i_widthstep*o_height,cudaMemcpyHostToDevice);

    dim3 blocksize(16,16);
    dim3 gridsize;
    gridsize.x=(o_width+blocksize.x-1)/blocksize.x;
    gridsize.y=(o_height+blocksize.y-1)/blocksize.y;

    cal_haar<<<gridsize,blocksize>>>(d_input,d_output,i_widthstep,o_widthstep,o_width,o_height);


    cudaMemcpy(output,d_output,o_widthstep*o_height,cudaMemcpyDeviceToHost);

    cudaFree(d_input);
    cudaFree(d_output);

}

The following is my main function:-

void main()
{
    int in_arr[8]={1,2,3,4,5,6,7,8};
    float out_arr[8];
    int i_widthstep=8*sizeof(int);
    int o_widthstep=8*sizeof(float);
    haar(in_arr,out_arr,i_widthstep,o_widthstep,8,1);

    for(int c=0;c<=7;c++)
    {cout<<out_arr[c]<<endl;}
    cvWaitKey();

}

Can you tell me where I am going wrong that it gives me zeros as output? Thank you.

Sorry,I don't understand. Can you mention the specific line of code kindly? — Code_Jamer
For example this signature global void cal_haar(int input[],float output [],int i_widthstep,float o_widthstep,int o_width,int o_height) try to use pointers like this global void cal_haar(int* input,float* output,int i_widthstep,float o_widthstep,int o_width,int o_height). — geek
Include additional relevant information in your question. It's very difficult to read and follow in a comment. — Bart
OK, so widthstep is in bytes, which takes care of malloc and memcpy. But in the kernel you use it to calculate an index into your float array, which means you'll be accessing memory which has not been allocated. And follow @talonmies advice and add some error checks. Also see the cuda-memcheck tool to help you find mistakes like this. — Peter

aland aland · Accepted Answer · 2012-05-23T20:35:45

The problem with your code is the following condition:

if(x_index>=o_width/2 || y_index>=o_height/2) return;

Given o_height = 1, we have o_height/2 = 0 (o_height is int, so we have integer division here with rounding down), so no threads perform any operations. To achieve what you want you can either do floating-point arithmetics here, or use (o_height+1)/2 and (o_width+1)/2: it would perform division with "arithmetic" rounding (you will have ( x_index >= (8+1)/2 /*= 4*/ && y_index >= (1+1)/2 /*= 1*/ ))

Besides, there is problem with addressing when you have more than 1 thread in Y-dimension, since then you i_thread_id and o_thread_id calculations would be wrong (_withstep is size in bytes, yet you use it as array index).

HAAR wavelet transform in CUDA

ALGORITHM

PROBLEM

CODE

1 Answers