gpu上的keras内存不足错误答案

【问题标题】：Out of memory error with keras on gpugpu上的keras内存不足错误
【发布时间】：2018-12-15 03:03:42
【问题描述】：

我想检查带有 tensorflow 后端的 keras 是否在 gpu 上运行良好。我运行this script 并得到以下输出：

Using TensorFlow backend.
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
170500096/170498071 [==============================] - 31s 0us/step
x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Using real-time data augmentation.
Epoch 1/100
2018-07-06 15:20:00.130371: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-07-06 15:20:00.209953: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-07-06 15:20:00.210289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 113.38MiB
2018-07-06 15:20:00.210305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-06 15:20:00.408052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-06 15:20:00.408100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
2018-07-06 15:20:00.408107: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
2018-07-06 15:20:00.408248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 57 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-07-06 15:20:00.408744: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 57.38M (60162048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-07-06 15:20:00.683832: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-07-06 15:20:00.685728: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-07-06 15:20:00.688354: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-07-06 15:20:00.689038: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-07-06 15:20:00.689718: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-07-06 15:20:00.690388: E tensorflow/stream_executor/cuda/cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2018-07-06 15:20:00.698165: E tensorflow/stream_executor/cuda/cuda_dnn.cc:455] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2018-07-06 15:20:00.698238: F tensorflow/core/kernels/conv_ops.cc:713] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 
Aborted (core dumped)

我可以阅读totalMemory: 3.95GiB freeMemory: 113.38MiB 和failed to allocate 57.38M (60162048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY。

为什么可用内存这么少？我该怎么做才能让the script 正常运行并最终享受 gpu 训练？

操作系统：Fedora 28

Python 3.6.6

Keras 2.2.0

张量流 1.8.0

GPU GeForce GTX 1050

【问题讨论】：

我要检查的第一件事是确保没有旧进程占用 GPU 内存。您可以使用 nvidia-smi 命令检查是否已设置。
@JeremyBare 提到这可能是另一个程序消耗内存的问题。

标签： tensorflow keras out-of-memory gpu

【解决方案1】：

这对我有用。

LIMIT = 3 * 1024
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.experimental.set_virtual_device_configuration(
            gpus[0],
            [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=LIMIT)])
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Virtual devices must be set before GPUs have been initialized
        print(e)

https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth

【讨论】：