I have access to Tesla K20c, I am running ResNet50 on CIFAR10 dataset...
Then I get the error as: THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu line=265 error=59 : device-side assert triggered
Traceback (most recent call last):
File "main.py", line 109, in <module>
train(loader_train, model, criterion, optimizer)
File "main.py", line 54, in train
optimizer.step()
File "/usr/local/anaconda35/lib/python3.6/site-packages/torch/optim/sgd.py", line 93, in step
d_p.add_(weight_decay, p.data)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:265
How to resolve this error
5 Answers
I encountered this error when running BertModel.from_pretrained('bert-base-uncased'). I found the solution by moving to the CPU when the error message changed to 'IndexError: index out of range in self'. Which led me to this post. The solution was to truncate sentences to length 512.
One way to raise the "CUDA error: device-side assert triggered" RuntimeError
, is by indexing into a GPU torch.Tensor
using a list
having out of dimension indices.
So, this snippet would raise an IndexError
with the message "IndexError: index 3 is out of bounds for dimension 0 with size 3", not the CUDA error
data = torch.randn((3,10), device=torch.device("cuda"))
data[3,:]
whereas, this one would raise the CUDA "device-side assert triggered" RuntimeError
data = torch.randn((3,10), device=torch.device("cuda"))
indices = [1,3]
data[indices,:]
which could mean that in case of class labels, such as in the answer by @Rainy, it's the final class label (i.e. when label == num_classes
) that is causing the error, when the labels start from 1 rather than 0.
Also, when device is "cpu"
the error thrown is IndexError
such as the one thrown by the first snippet.
This error can be made more elaborative if you switch to CPU first. Once you switch to CPU, it will show the exact error, which is most probably related to the indexing problem, which is IndexError: Target 2 is out of bounds in my case and could be related in yours case. The issue is "How many classes are you currently using and what is the shape of your output?", you can find the classes like this
max(train_labels)
min(train_labels)
which in my case gave me 2 and 0, the problem is caused by missing 1 index, so a quick hack is to quickly replace all 2s with 1s , which can be done through this code:
train_=train.copy()
train_['label'] =train_['label'].replace(2,1)
then you run the same code and see the results, it should work
class NDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = NDataset(train_encodings, train_labels)
val_dataset = NDataset(val_encodings, val_labels)
test_dataset = NDataset(test_encodings, test_labels)
CUDA_LAUNCH_BLOCKING=1 python your_script.py
to get a more accuracte stack trace. – McLawrence/opt/conda/.../THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed.
This would come around 20 times. then the Traceback follows:RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:116
how to resolve? – saichandt >= 0 && t < n_classes
. print your labels and make sure that they are positive and smaller than the number of outputs of your last layer. – McLawrence