I suppose everybody who has worked with Pytorch knows the errorRuntimeError: CUDA error: device-side assert triggered
to some extend.
I'm generating a lot of data with GPU-Code in my script (200k+ long vectors), so it takes a while.
I'm doing this in batches via a generator as I do not have the memory to store all vectors in my GPU at once. The generator has the following structure:
for i in range(0, len(inputs), batch_size):
try:
<generate the vectors>
yield 1, <the vectors> # Here it was successful
except RuntimeError:
print(f'could not generate vectors {index} to {index + batch_size}')
yield 0, (i, i+ batch_size) # Here the input was malformed
I know that some of the input is malformed to the point that generating vectors from it will fail with a runtime error and that's fine, it's not even 1% of my dataset. I want to get the indices and deal with it later.
Here's my problem
Once vector creation fails, the GPU is basically bricked and will respond to all requests with aforementioned error. Validating all input beforehand would be cumbersome and slow. I don't want to do it. I want to roll over all malformed inputs and deal with it later.
My question is
How can I recover the GPU from this bricked state as easy and fast as possible? All questions that I have found so far ask about fixing the underlying error, which I do not need to do. I just want to get on with generating vectors from my dataset.