为什么在测试中出现“RuntimeError CUDA out of memory”？答案

【问题标题】：why "RuntimeError CUDA out of memory" in testing?为什么在测试中出现“RuntimeError CUDA out of memory”？
【发布时间】：2019-11-25 19:13:13
【问题描述】：

同样的model 在批量大小=5 的训练中运行良好。由于同样的错误，我在训练期间将批量大小从 80 减少到 5。我使用的是 11GB 内存的 GPU，而不是作者在实际实验中使用的 Titan X（12GB 内存）。

但是，现在在测试中，只有 batch-size=1，它没有运行。

问题在I-frame model测试阶段，其他两个模型已经成功生成测试结果。

以下是我的测试命令：

time python test.py --arch resnet152 --data-name ucf101 --representation iframe --data-root data/ucf101/mpeg4_videos --test-list data/datalists/ucf101_split1_test.txt --weights ucf101_iframe_model_iframe_model_best.pth.tar --save-scores iframe_score_file

我使用nvidia-smi 来确保GPU 上没有运行其他任何东西。

以下是实际的错误信息：

RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 10.92 GiB total capacity; 10.12 GiB already allocated; 245.50 MiB free; 21.69 MiB cached)

可能是什么问题以及如何解决？

编辑：通过从 test.py 中删除以下两行，它开始运行而没有内存问题，但需要很长时间才能处理：

net = torch.nn.DataParallel(net.cuda(devices[0]), device_ids=devices)
net.eval()

是的，以上行用于基于 GPU 的并行处理。

但是，我的问题有解决方案吗？

【问题讨论】：

您可以尝试使用 del var_name 显式删除变量（位于 GPU 上）。此外，您可以在代码中的不同点使用this memory_allocated function，以查看当时分配了多少内存。这应该可以帮助您确定哪些部分会占用您的 GPU 内存。

标签： python python-2.7 numpy deep-learning pytorch

【解决方案1】：

我建议您可以先检查您的测试代码。

你可以试试：

with torch.no_grad():

它将减少原本需要 requires_grad=True 的计算的内存消耗。

原答案（如果你有更大的 GPU，你可以试试）：

可能模型本身和参数会占用大量内存。

您可以在您之前使用的 Titan X GPU 上尝试“batch-size=1”，并观察 GPU 内存使用量是否超过 11 GB。如果是这样，您现在使用的 GPU（11 GB 内存）可能不适合这项工作。

【讨论】：

OP 说原作者使用的是 Titan X。他只有 11 GB 显存 GPU（可能是 GTX 1080Ti 之类的）

【解决方案2】：

我已在内存高达 8GB 的 GPU 上运行此模型/测试，方法是在问题中给出的测试命令中添加以下标志：

--test-crops 1

【讨论】：