在 GPU 上使用 Keras 内存不足答案

【问题标题】：Out of memory using Keras on GPU在 GPU 上使用 Keras 内存不足
【发布时间】：2017-04-25 05:33:57
【问题描述】：

我想将深度学习应用于我的分类问题，其中我的数据集中灰度图像的大小为200x200。目前，我正在我的大型数据集（超过 15,000 张图像）的一个非常小的子集（152 张图像）上测试 DL；我在 Python（Python 2.7.12 :: Anaconda 4.2.0（64 位））中使用 Keras（版本 '1.1.2'）库和 Theano（版本 '0.9.0.dev4'）后端。我的代码在 CPU 中运行，但速度很慢。所以，我切换到GPU。但是，我收到以下错误：

Using Theano backend.
Using gpu device 0: GeForce GTS 450 (CNMeM is enabled with initial size: 70.0% of memory, cuDNN not available)

Train on 121 samples, validate on 31 samples
Epoch 1/200
Traceback (most recent call last):

  File "<ipython-input-6-247bada3ec1a>", line 2, in <module>
    verbose=1, validation_data=(X_test, Y_test))

  File "/home/user1/anaconda2/envs/keras_env/lib/python2.7/site-packages/keras/models.py", line 652, in fit
    sample_weight=sample_weight)

  File "/home/user1/anaconda2/envs/keras_env/lib/python2.7/site-packages/keras/engine/training.py", line 1111, in fit
    initial_epoch=initial_epoch)

  File "/home/user1/anaconda2/envs/keras_env/lib/python2.7/site-packages/keras/engine/training.py", line 826, in _fit_loop
    outs = f(ins_batch)

  File "/home/user1/anaconda2/envs/keras_env/lib/python2.7/site-packages/keras/backend/theano_backend.py", line 811, in __call__
    return self.function(*inputs)

  File "/home/user1/anaconda2/envs/keras_env/lib/python2.7/site-packages/theano/compile/function_module.py", line 886, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))

  File "/home/user1/anaconda2/envs/keras_env/lib/python2.7/site-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)

  File "/home/user1/anaconda2/envs/keras_env/lib/python2.7/site-packages/theano/compile/function_module.py", line 873, in __call__
    self.fn() if output_subset is None else\

MemoryError: Error allocating 160579584 bytes of device memory (CNMEM_STATUS_OUT_OF_MEMORY).
Apply node that caused the error: GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuElemwise{Add}[(0, 0)].0)
Toposort index: 60
Inputs types: [CudaNdarrayType(float32, (True, True, True, True)), CudaNdarrayType(float32, 4D)]
Inputs shapes: [(1, 1, 1, 1), (32, 32, 198, 198)]
Inputs strides: [(0, 0, 0, 0), (1254528, 39204, 198, 1)]
Inputs values: [CudaNdarray([[[[ 0.5]]]]), 'not shown']
Outputs clients: [[GpuContiguous(GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

我尝试了建议的解决方案（optimizer=fast_compile 和 optimizer=None），但没有成功。我知道问题与图像大小有关，因为当我将图像大小调整为 50x50 时它起作用了。

你知道我可以如何解决这个问题以将其应用于200x200 图像吗？

我使用的是 Linux Mageia 5，我的 GPU 信息是：

02:00.0 VGA compatible controller: NVIDIA Corporation GF106 [GeForce GTS 450] (rev a1)
[    64.299] (--) NVIDIA(0): Memory: 1048576 kBytes
[    64.313] (II) NVIDIA: Using 12288.00 MB of virtual memory for indirect memory
[    64.439] (==) NVIDIA(0): Disabling shared memory pixmaps

我不确定使用 cuDNN 是否是解决我的问题的正确方法，但我已经尝试通过在 .theanorc 中包含 optimizer_including=cudnn 来使用它；但我收到以下错误：

AssertionError: cuDNN optimization was enabled, but Theano was not able to use it. We got this error: 
Device not supported

我认为这可能是因为我的 GPU 计算兼容性为 2.1（低于 cudnn GPU cc 要求（3.0 或更高））。

如果您能帮助我解决问题并在我的大型数据集上运行深度学习，我将不胜感激？

【问题讨论】：

不看代码很难说，但你能把图片分小批加载到gpu吗？
cuDNN 不适用于您的 GTS450 Fermi (GF106) GPU。 cuDNN 需要开普勒 GPU。您的 GPU 似乎内存不足，而 GTS450 是一款相当老旧的低端 GPU，没有太多内存 (1GB)。
@Atirag 我尝试了较小的批量，但我收到了类似的错误。
@RobertCrovella 使用 cuDNN 是解决此类问题的唯一解决方案吗？
@SaraG。在这种情况下，您可能需要使用一些允许对如何处理数据进行更多自定义的东西。也许尝试 tensorflow 或 theano？或者只是在具有更好 GPU 的机器上运行代码。我对 keras 不是很熟悉，所以我不知道您是否可以自定义它，使其以更智能的方式加载数据。

标签： python deep-learning theano keras

【解决方案1】：

它说你的 GPU 内存不足。所以改变批量大小，不要直接使用共享变量将所有数据加载到 GPU，而是在它们之间迭代。否则找另一个内存容量更大的GPU

【讨论】：

我将批处理大小更改为较小的值，但没有解决问题。您能否详细说明您的第二个建议解决方案（使用共享变量并在它们之间进行迭代）？你的意思是我需要将我的输入图像分成不同的子集，并使用共享变量在每个 epoch 的不同迭代中将它们提供给 CNN？
当您将图像放入共享变量时，只需添加补丁大小的数量而不是完整数据集。并在每次迭代中再次加载新图像并更新共享变量。我在这里说的是减少 GPU 上加载的数据量
这样一个无益的答案。显然，问题是太多数据被加载到 GPU 中。您没有深入了解如何实际调整加载的数据量，只是重申“减少数据量”。
@Feras 很好的答案，它帮助了我