0
votes

I am trying to run a hello-world DPC++ sample of oneAPI which adds two 1-D Arrays on both CPU and GPU, and verifies the results. Code is shown below:

/*
DataParallel Addition of two Vectors
*/

#include <CL/sycl.hpp>
#include <array>
#include <iostream>
using namespace sycl;

constexpr size_t array_size = 100000;
typedef std::array<int, array_size> IntArray;

// Initialize array with the same value as its index
void InitializeArray(IntArray& a) { for (size_t i = 0; i < a.size(); i++) a[i] = i; }

/*
Create an asynchronous Exception Handler for sycl
*/
static auto exception_handler = [](cl::sycl::exception_list eList) {
    for (std::exception_ptr const& e : eList) {
        try {
            std::rethrow_exception(e);
        }
        catch (std::exception const& e) {
            std::cout << "Failure" << std::endl;
            std::terminate();
        }
    }
};

void VectorAddParallel(queue &q, const IntArray& x, const IntArray& y, IntArray& parallel_sum) {
    range<1> num_items{ x.size() };
    
    buffer x_buf(x);
    buffer y_buf(y);
    buffer sum_buf(parallel_sum.data(), num_items);

    /*
    Submit a command group to the queue by a lambda
    which contains data access permissions and device computation
    */
    q.submit([&](handler& h) {

        auto xa = x_buf.get_access<access::mode::read>(h);
        auto ya = y_buf.get_access<access::mode::read>(h);
        auto sa = sum_buf.get_access<access::mode::write>(h);

        std::cout << "Adding on GPU (Parallel)\n";
        h.parallel_for(num_items, [=](id<1> i) { sa[i] = xa[i] + ya[i]; });
        std::cout << "Done on GPU (Parallel)\n";
    });

    /*
    queue runs the kernel asynchronously. Once beyond the scope,
    buffers' data is copied back to the host.
    */
}

int main() {
    default_selector d_selector;
    IntArray a, b, sequential, parallel;

    InitializeArray(a);
    InitializeArray(b);

    try {
        // Queue needs: Device and Exception handler
        queue q(d_selector, exception_handler);
        
        std::cout << "Accelerator: " 
                  << q.get_device().get_info<info::device::name>() << "\n";
        std::cout << "Vector size: " << a.size() << "\n";
        VectorAddParallel(q, a, b, parallel);
    }
    catch (std::exception const& e) {
        std::cout << "Exception while creating Queue. Terminating...\n";
        std::terminate();
    }
    
    /*
    Do the sequential, which is supposed to be slow
    */
    std::cout << "Adding on CPU (Scalar)\n";
    for (size_t i = 0; i < sequential.size(); i++) {
        sequential[i] = a[i] + b[i];
    }
    std::cout << "Done on CPU (Scalar)\n";
    
    /*
    Verify results, the old-school way
    */
    for (size_t i = 0; i < parallel.size(); i++) {
        if (parallel[i] != sequential[i]) {
            std::cout << "Fail: " << parallel[i] << " != " << sequential[i] << std::endl;
            std::cout << "Failed. Results do not match.\n";
            return -1;
        }
    }
    std::cout << "Success!\n";
    return 0;
}

With a relatively small array_size, (I tested 100-50k elements) the computation works out to be fine. Sample output:

Accelerator: Intel(R) Gen9
Vector size: 50000
Adding on GPU (Parallel)
Done on GPU (Parallel)
Adding on CPU (Scalar)
Done on CPU (Scalar)
Success!

It can be noted that it takes barely a second to finish the computation on both CPU and GPU. But when I increase the array_size, to say, 100000, I get this seemingly clueless error:

C:\Users\myuser\source\repos\dpcpp-iotas\x64\Debug\dpcpp-iotas.exe (process 24472) exited with code -1073741571.

Although I am not sure at what precise value the error starts occurring, but I seem to be sure it happens after around 70000. I seem to have no idea why this is happening, any insights on what can be wrong?

2
You're likely trying to put too much data on the stack. Consider using a std::vector or dynamic allocation.Retired Ninja
When you get a gonzo number like -1073741571, convert it to hex (C00000FD) and see if that's more recognizable. A quick google says that's Windows telling you you're probably had a stack overflow.user4581301
Alright, I will try using dynamic allocation and update this thread. ThanksKaran Shah

2 Answers

1
votes

Turns out, this is due to Stack size reinforcement by VS. Contiguous array with too many elements resulted in a stack overflow.

As mentioned by @user4581301, the error code -107374171 in hex, gives C00000FD, which is a signed representation of 'stack exhaustion/overflow' in Visual Studio.

Ways to fix this:

  1. Increase the /STACK reserve to something higher than 1MB (this is the default) in the Project Properties > Linker > System > Stack Reserve/Commit values.
  2. Use a binary editor (editbin.exe and dumpbin.exe) to edit /STACK:reserve.
  3. Use std::vector instead, which allows dynamic allocation (suggested by @Retired Ninja).

I couldn't find an option to change /STACK in oneAPI, the normal way in Linker properties, shown here.

I decided to go with dynamic allocation.

Related: https://stackoverflow.com/a/26311584/9230398

0
votes

When I program big applications I always do a

ulimit -s unlimited

to explain to the shell that I am grown up and I really want some space on my stack.

Here this is the bash syntax but you can obviously adapt to some other shells.

I guess there might be an equivalent for non-UNIX OS?