pytorch recover from RuntimeError: CUDA error: device-side assert triggered without restarting script

Question

I suppose everybody who has worked with Pytorch knows the error
RuntimeError: CUDA error: device-side assert triggered
to some extend.
I'm generating a lot of data with GPU-Code in my script (200k+ long vectors), so it takes a while. I'm doing this in batches via a generator as I do not have the memory to store all vectors in my GPU at once. The generator has the following structure:

    for i in range(0, len(inputs), batch_size):
        try:
            <generate the vectors>
            yield 1, <the vectors>  # Here it was successful
        except RuntimeError:
            print(f'could not generate vectors {index} to {index + batch_size}')
            yield 0, (i, i+ batch_size)  # Here the input was malformed

I know that some of the input is malformed to the point that generating vectors from it will fail with a runtime error and that's fine, it's not even 1% of my dataset. I want to get the indices and deal with it later.

Here's my problem
Once vector creation fails, the GPU is basically bricked and will respond to all requests with aforementioned error. Validating all input beforehand would be cumbersome and slow. I don't want to do it. I want to roll over all malformed inputs and deal with it later.

My question is
How can I recover the GPU from this bricked state as easy and fast as possible? All questions that I have found so far ask about fixing the underlying error, which I do not need to do. I just want to get on with generating vectors from my dataset.

Georgi Georgiev Georgi Georgiev · Accepted Answer · 2020-03-27T17:01:48

One way to do it might be to log your progress through the generated input vectors and restart the process/machine if the GPU gets bricked. If the percentage of malformed inputs are small enough, the cost of resetting the GPU/machine might be negligible. You can have a periodic job which checks if you're done with the job and restarts it if it's not. This is a crude way to approach this problem but it should work.

For example:

for i in range(0, len(inputs), batch_size):
    try:
        exist = check_if_current_index_has_succeded_or_failed()
        if exist:
            continue
        else:
            log_current_index()
        <generate the vectors>
        log_success()
        yield 1, <the vectors>  # Here it was successful
    except RuntimeError:
        log_failure()
        print(f'could not generate vectors {index} to {index + batch_size}')
        yield 0, (i, i+ batch_size)  # Here the input was malformed

pytorch recover from RuntimeError: CUDA error: device-side assert triggered without restarting script

1 Answers