42
votes

I have access to Tesla K20c, I am running ResNet50 on CIFAR10 dataset... Then I get the error as:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu line=265 error=59 : device-side assert triggered
Traceback (most recent call last):
File "main.py", line 109, in <module>
train(loader_train, model, criterion, optimizer)
File "main.py", line 54, in train optimizer.step()
File "/usr/local/anaconda35/lib/python3.6/site-packages/torch/optim/sgd.py", line 93, in step
d_p.add_(weight_decay, p.data) RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:265
How to resolve this error

5
try running your script with CUDA_LAUNCH_BLOCKING=1 python your_script.py to get a more accuracte stack trace.McLawrence
after running with CUDA_LAUNC...=1, I get the error as /opt/conda/.../THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed. This would come around 20 times. then the Traceback follows: RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:116 how to resolve?saichand
This is an error with your target labels: t >= 0 && t < n_classes. print your labels and make sure that they are positive and smaller than the number of outputs of your last layer.McLawrence
n_classes should be same as the output of the last layer.. Is it right?saichand
That's right. Your targets likely assume to high values.McLawrence

5 Answers

39
votes

In general, when encountering cuda runtine errors, it is advisable to run your program again using the CUDA_LAUNCH_BLOCKING=1 flag to obtain an accurate stack trace.

In your specific case, the targets of your data were too high (or low) for the specified number of classes.

52
votes

I have encountered this problem several times. And I find it to be an index issue. For example, if your ground truth label starts at 1: target = [1,2,3,4,5], then you should subtract 1 for every label, change it to: [0,1,2,3,4]. This solves my problem every time.

4
votes

I encountered this error when running BertModel.from_pretrained('bert-base-uncased'). I found the solution by moving to the CPU when the error message changed to 'IndexError: index out of range in self'. Which led me to this post. The solution was to truncate sentences to length 512.

3
votes

One way to raise the "CUDA error: device-side assert triggered" RuntimeError, is by indexing into a GPU torch.Tensor using a list having out of dimension indices.

So, this snippet would raise an IndexError with the message "IndexError: index 3 is out of bounds for dimension 0 with size 3", not the CUDA error

data = torch.randn((3,10), device=torch.device("cuda"))
data[3,:]

whereas, this one would raise the CUDA "device-side assert triggered" RuntimeError

data = torch.randn((3,10), device=torch.device("cuda"))
indices = [1,3]
data[indices,:]

which could mean that in case of class labels, such as in the answer by @Rainy, it's the final class label (i.e. when label == num_classes) that is causing the error, when the labels start from 1 rather than 0.

Also, when device is "cpu" the error thrown is IndexError such as the one thrown by the first snippet.

1
votes

This error can be made more elaborative if you switch to CPU first. Once you switch to CPU, it will show the exact error, which is most probably related to the indexing problem, which is IndexError: Target 2 is out of bounds in my case and could be related in yours case. The issue is "How many classes are you currently using and what is the shape of your output?", you can find the classes like this

max(train_labels)
min(train_labels)

which in my case gave me 2 and 0, the problem is caused by missing 1 index, so a quick hack is to quickly replace all 2s with 1s , which can be done through this code:

train_=train.copy()
train_['label'] =train_['label'].replace(2,1)

then you run the same code and see the results, it should work

class NDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = NDataset(train_encodings, train_labels)
val_dataset = NDataset(val_encodings, val_labels)
test_dataset = NDataset(test_encodings, test_labels)