Array Length in CUDA thrust

Question

My CUDA kernel is using thrust, sort and reduce by key. when i use an array more than 460 it start showing incorrect results.

could any one explain this behavior? or it is something related to my machine ?

the sort is working correctly despite the size, however, the REDUCE_BY_KEY is not working good. and return improper results.

more details about the code, i have 4 arrays 1) input keys which is defined as wholeSequenceArray. 2) input values which is defined in the kernel with initial value of 1. 3) output keys is to save the distinct values of the input keys 4) output values is to save the sum of input values corresponding to the same input key .

for more description about the reduce_by_key please visit this page: https://thrust.github.io/doc/group__reductions.html#gad5623f203f9b3fdcab72481c3913f0e0

here is my code:

#include <cstdlib>
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <fstream>
#include <string>
#include <cuda.h>
#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>

using namespace std;
#define size 461

__global__ void calculateOccurances(unsigned int *input_keys,
            unsigned int *output_Values) {
    int tid = threadIdx.x;

    const int N = size;
    __shared__ unsigned int input_values[N];

    unsigned int outputKeys[N];

    int i = tid;
    while (i < N) {
            if (tid < N) {
                    input_values[tid] = 1;
            }
            i += blockDim.x;
    }
    __syncthreads();

    thrust::sort(thrust::device, input_keys, input_keys + N);

    thrust::reduce_by_key(thrust::device, input_keys, input_keys + N,
                    input_values, outputKeys, output_Values);

    if (tid == 0) {
            for (int i = 0; i < N; ++i) {
                    printf("%d,", output_Values[i]);
            }
    }

}

int main(int argc, char** argv) {

    unsigned int wholeSequenceArray[size] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                    11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                    10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8,
                    9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7,
                    8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6,
                    7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5,
                    6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4,
                    5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3,
                    4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2,
                    3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1,
                    2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
                    20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
                    19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
                    18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
                    17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
                    16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
                    15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
                    14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                    13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
                    12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                    11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                    10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8,
                    9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7,
                    8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,1 };

    cout << "wholeSequenceArray:" << endl;
    for (int i = 0; i < size; i++) {
            cout << wholeSequenceArray[i] << ",";
    }

    cout << "\nStart C++ Array New" << endl;
    cout << "Size of Input:" << size << endl;

    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    printf("Max threads per block:  %d\n", prop.maxThreadsPerBlock);

    unsigned int counts[size];
    unsigned int *d_whole;
    unsigned int *d_counts;

    cudaMalloc((void**) &d_whole, size * sizeof(unsigned int));
    cudaMalloc((void**) &d_counts, size * sizeof(unsigned int));

    cudaMemcpy(d_whole, wholeSequenceArray, size * sizeof(unsigned int),
                    cudaMemcpyHostToDevice);

    calculateOccurances<<<1, size>>>(d_whole, d_counts);

    cudaMemcpy(counts, d_counts, size * sizeof(unsigned int),
                    cudaMemcpyDeviceToHost);

    cout << endl << "Counts" << endl << endl;
    for (int i = 0; i < size; ++i) {
            cout << counts[i] << ",";
    }
    cout << endl;

    cudaFree(d_whole);
}

No, it runs smoothly, i removed the cuda errors code just to make code smaller :) — Emad R
I don't think you understand how using thrust in device code works. You have 461 threads each of which is doing its own, separate sort on the same data, in the same place. That could not possibly be a useful algorithm. Those 461 threads will each be stepping on each other as they are moving the data around to be sorted. It's not clear to me that you need a CUDA kernel at all here. Your described algorithm could be accomplished just using thrust in an ordinary fashion (i.e. from host code). The work will still get done on the device. — Robert Crovella

Robert Crovella Robert Crovella · Accepted Answer · 2017-02-13T02:22:23

When you call a thrust algorithm in a kernel, that thrust algorithm is dispatched in its entirety from each CUDA thread. Therefore your code is performing 461 sort operations on the same data (once from each CUDA kernel thread) in the same place. This means each thread will be stepping on each other as they move the data about during the sorting operation.

If you just want to count occurrences of numbers (effectively histogramming) using the method you have outlined in your question, and you want to use thrust, you don't need to write a CUDA kernel at all.

If you really want to do this (correctly) from within a CUDA kernel, then it will be necessary for you to restrict the thrust operations (sort and reduce_by_key) to only act from a single thread. (and even this methodology will be restricted to a single block).

I don't really think the second (CUDA kernel) approach makes much sense, but for completeness I have modified your code to include a correct example of each method. Note that once you perform the reduction, there is no longer any point in printing out all 461 entries in each array, so I've restricted the printouts to the first 25 entries in each array for clarity:

$ cat t91.cu
#include <cstdlib>
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <fstream>
#include <string>
#include <cuda.h>
#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>
#include <thrust/iterator/constant_iterator.h>

using namespace std;
#define size 461

__global__ void calculateOccurances(unsigned int *input_keys,
            unsigned int *output_Values) {
    int tid = threadIdx.x;

    const int N = size;
    __shared__ unsigned int input_values[N];

    unsigned int outputKeys[N];

    int i = tid;
    while (i < N) {
            if (tid < N) {
                    input_values[tid] = 1;
            }
            i += blockDim.x;
    }
    __syncthreads();
    if (tid == 0){
      thrust::sort(thrust::device, input_keys, input_keys + N);

      thrust::reduce_by_key(thrust::device, input_keys, input_keys + N,
                    input_values, outputKeys, output_Values);
      }

    if (tid == 0) {
    printf("from kernel:\n");
            for (int i = 0; i < 25; ++i) {
                    printf("%d,", output_Values[i]);
            }
    }

}

int main(int argc, char** argv) {

    unsigned int wholeSequenceArray[size] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                    11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                    10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8,
                    9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7,
                    8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6,
                    7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5,
                    6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4,
                    5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3,
                    4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2,
                    3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1,
                    2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
                    20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
                    19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
                    18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
                    17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
                    16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
                    15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
                    14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
                    13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
                    12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                    11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                    10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8,
                    9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7,
                    8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,1 };

    cout << "wholeSequenceArray:" << endl;
    for (int i = 0; i < size; i++) {
            cout << wholeSequenceArray[i] << ",";
    }

    cout << "\nStart C++ Array New" << endl;
    cout << "Size of Input:" << size << endl;

    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    printf("Max threads per block:  %d\n", prop.maxThreadsPerBlock);

//just using thrust

    thrust::device_vector<int> d_seq(wholeSequenceArray, wholeSequenceArray+size);
    thrust::device_vector<int> d_val_out(size);
    thrust::device_vector<int> d_key_out(size);

    thrust::sort(d_seq.begin(), d_seq.end());
    int rsize = thrust::get<0>(thrust::reduce_by_key(d_seq.begin(), d_seq.end(), thrust::constant_iterator<int>(1), d_key_out.begin(), d_val_out.begin())) - d_key_out.begin();
    std::cout << "rsize:" << rsize <<  std::endl;
    std::cout << "Thrust keys:" << std::endl;
    thrust::copy_n(d_key_out.begin(), rsize, std::ostream_iterator<int>(std::cout, ","));
    std::cout << std::endl << "Thrust vals:" << std::endl;
    thrust::copy_n(d_val_out.begin(), rsize, std::ostream_iterator<int>(std::cout, ","));
    std::cout << std::endl;


// in a cuda kernel


    unsigned int counts[size];
    unsigned int *d_whole;
    unsigned int *d_counts;

    cudaMalloc((void**) &d_whole, size * sizeof(unsigned int));
    cudaMalloc((void**) &d_counts, size * sizeof(unsigned int));

    cudaMemcpy(d_whole, wholeSequenceArray, size * sizeof(unsigned int),
                    cudaMemcpyHostToDevice);

    calculateOccurances<<<1, size>>>(d_whole, d_counts);

    cudaMemcpy(counts, d_counts, size * sizeof(unsigned int),
                    cudaMemcpyDeviceToHost);

    std::cout << "from Host:" << std::endl;
    cout << endl << "Counts" << endl << endl;
    for (int i = 0; i < 25; ++i) {
            cout << counts[i] << ",";
    }
    cout << endl;

    cudaFree(d_whole);
}
$ nvcc -arch=sm_61 -o t91 t91.cu
$ ./t91
wholeSequenceArray:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,1,
Start C++ Array New
Size of Input:461
Max threads per block:  1024
rsize:20
Thrust keys:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,
Thrust vals:
24,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,
from kernel:
24,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,526324,526325,526325,526327,526329,from Host:

Counts

24,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,526324,526325,526325,526327,526329,
$

Notes:

I've included a method in the thrust example so you can know exactly the size of the output arrays.
The thrust method should work fine independent of the size parameter - subject to the limits of your GPU (e.g. memory size). The CUDA kernel method really is just doing the thrust code from a single thread, so it's not really sensible to run more than 1 block.
You may wish to refer to this question/answer for more discussion around using thrust from CUDA kernels.

Array Length in CUDA thrust

1 Answers