【发布时间】:2019-01-12 11:09:39
【问题描述】:
我可以访问 Tesla K20c,我正在 CIFAR10 数据集上运行 ResNet50... 然后我得到错误:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu line=265 error=59 : device-side assert triggered
Traceback (most recent call last):
File "main.py", line 109, in <module>
train(loader_train, model, criterion, optimizer)
File "main.py", line 54, in train
optimizer.step()
File "/usr/local/anaconda35/lib/python3.6/site-packages/torch/optim/sgd.py", line 93, in step
d_p.add_(weight_decay, p.data)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:265
如何解决这个错误?
【问题讨论】:
-
尝试使用
CUDA_LAUNCH_BLOCKING=1 python your_script.py运行脚本以获得更准确的堆栈跟踪。 -
在使用 CUDA_LAUNC...=1 运行后,我得到
/opt/conda/.../THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed.的错误,这将出现大约 20 次。然后 Traceback 如下:RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524580978845/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:116如何解决? -
这是您的目标标签错误:
t >= 0 && t < n_classes。打印您的标签并确保它们是正数且小于您最后一层的输出数。 -
n_classes 应该和最后一层的输出一样.. 对吗?
-
没错。您的目标可能具有很高的价值。