0
votes

I'm trying to use local memory inside a device-side enqueued kernel.

My assumption that any locally-declared array is visible across all work items in the work group. This is proven to be true when I use local memory on kernels that are called from the host-side, but I'm running into problems when I use a similar setup on device-side enqueued kernels.

Is there something wrong with my assumption?

Edit:
My kernel is below: My goal is to sort the FIFO pipe into 3 buffers. The problem is that my work items have a limited view scope, and I'm trying to write the buffers into another pipe.

int pivot;

int in_pipe[BIN_SIZE];
int lt_bin[BIN_SIZE];
int gt_bin[BIN_SIZE];
int e_bin[BIN_SIZE];


reserve_id_t down_id = work_group_reserve_read_pipe(down_pipe, local_size);
//while ( is_valid_reserve_id(down_id) == false){
//  down_id = work_group_reserve_read_pipe(down_pipe, local_size);
//}
//in_bin[tid] = -5;
if( is_valid_reserve_id(down_id) == true){
    int status = read_pipe(down_pipe, down_id, lid, &pipe_out);
    work_group_commit_read_pipe(down_pipe, down_id);

    pivot = pipe_out;
    pivot = work_group_broadcast(pivot, 0); 

    work_group_barrier(CLK_GLOBAL_MEM_FENCE);
    work_group_barrier(CLK_LOCAL_MEM_FENCE);    
    in_pipe[tid] = pipe_out;        
    //in_bin[lid] = in_pipe[tid];

    int e_count = 0;
    int gt_count = 0;
    int lt_count = 0;       

    if(in_pipe[tid] == pivot){
        e_count = 1;
    }
    else if(in_pipe[tid] < pivot){
        lt_count = 1;
    }
    else if(in_pipe[tid] > pivot){
        gt_count = 1;
    }

    int e_tot = work_group_reduce_add(e_count);
    e_tot = work_group_broadcast(e_tot, 0);
    int e_val = work_group_scan_exclusive_add(e_count);     

    int gt_tot = work_group_reduce_add(gt_count);
    gt_tot = work_group_broadcast(gt_tot, 0);
    int gt_val = work_group_scan_exclusive_add(gt_count);   

    int lt_tot = work_group_reduce_add(lt_count);
    lt_tot = work_group_broadcast(lt_tot, 0);
    int lt_val = work_group_scan_exclusive_add(lt_count);           

    //in_bin[tid] = lt_val;
    work_group_barrier(CLK_GLOBAL_MEM_FENCE);
    work_group_barrier(CLK_LOCAL_MEM_FENCE);

    if(in_pipe[tid] == pivot){
        e_temp[e_val] = in_pipe[tid];
        //in_bin[e_val] = e_bin[e_val];
        //e_bin[e_Val] = work_group_broadcast(e_bin[e_val], lid);
    }
    if(in_pipe[tid] < pivot){
        lte_temp[lt_val] = in_pipe[tid];
        //in_bin[lt_val] = lt_bin[lt_val];
    }
    if(in_pipe[tid] > pivot){
        gt_bin[gt_val] = in_pipe[tid];
        //in_bin[gt_val] = gt_bin[gt_val];
    }
2

2 Answers

0
votes

No, not wrong. Local variables are declared and used across whole work-groups device-side too. They won't be shared with the parent kernels, though.

What exactly are you doing?

0
votes

The working solution to my question is:

Pipes cannot be created on the device side. What I tried to accomplish was to make a dynamic tree structure, involving branches. OpenCL pipes simply cannot do that, as pipes are still memory objects, created on the host-side. There is no current way in the specifications to create memory objects.

Pipes, however, can be used in a dynamically-recursive method, albeit the recursion cannot deviate, and must occur in a linear fashion. Please consult the sample code found in the AMD APP SDK sample code packs for more details. Specifically, please look at the Device Enqueue BFS example.