【问题标题】:Keras Out of memory with small batch sizeKeras 小批量内存不足
【发布时间】:2019-01-29 22:03:59
【问题描述】:

我仅使用 tensorflow 库构建了一个自动编码器,其网络形状为:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 168, 120, 3)       0
_________________________________________________________________
flatten_1 (Flatten)          (None, 60480)             0
_________________________________________________________________
dense_1 (Dense)              (None, 1024)              61932544
_________________________________________________________________
dense_2 (Dense)              (None, 256)               262400
_________________________________________________________________
dense_3 (Dense)              (None, 1024)              263168
_________________________________________________________________
dense_4 (Dense)              (None, 60480)             61992000
_________________________________________________________________
reshape_1 (Reshape)          (None, 168, 120, 3)       0
=================================================================
Total params: 124,450,112
Trainable params: 124,450,112
Non-trainable params: 0
_________________________________________________________________

在仅使用 tensorflow 的项目中,我能够毫无问题地使用我的 GPU 进行训练,批量大小为 128。我想仅使用 keras 重新创建自动编码器,但我遇到了内存不足的异常,即使使用批量大小为一个。通过研究这个问题,我发现这个问题的最佳解决方案是减少批量大小,但我无法进一步减少它。我的机器有 2 个在 SLI 中运行的 GTX 970 卡(CUDA 不关心 SLI),总共 8GB 内存。为什么我无法使用 keras 训练这个网络,即使我能够使用 tensorflow 以 64 倍的批量大小训练相同的网络?

以下是相关代码:

常量:

# Constants

WIDTH = 120
HEIGHT = 168
CHANNELS = 3
NUM_INPUTS = WIDTH*HEIGHT*CHANNELS
BATCH_SIZE=1
NUM_SAMPLES=5000
VALIDATION_SIZE=1
VALIDATION_SAMPLES=100
EPOCHS=1000

HIDDEN_WIDTH = 1024
ENCODING_WIDTH = 256

INPUT_PATH = './input/'
VALIDATION_PATH = './validation/'
MODEL_PATH = './model/'

MODEL_FILE = 'my_model.h5'
EPOCH_FILE = 'initial_epoch.txt'  

初始化并保存:

# this is our input placeholder
input_img = Input(shape=(constants.HEIGHT,constants.WIDTH,constants.CHANNELS))
# flatten image into one dimension
flatten = Flatten()(input_img)
# hidden layer 1
hidden = Dense(constants.HIDDEN_WIDTH, activation='relu')(flatten)
# "encoded" is the encoded representation of the input
encoded = Dense(constants.ENCODING_WIDTH, activation='relu')(hidden)
# hidden layer 3
hidden = Dense(constants.HIDDEN_WIDTH, activation='relu')(encoded)
# "decoded" is the lossy reconstruction of the input
decoded = Dense(constants.NUM_INPUTS, activation='relu')(hidden)
# reshape to image dimensions
reshape = Reshape((constants.HEIGHT,constants.WIDTH,constants.CHANNELS))(decoded)

# this model maps an input to its reconstruction
autoencoder = Model(input_img, reshape)

autoencoder.summary()

autoencoder.compile(optimizer='adam', loss='mean_squared_error')

train_datagen = ImageDataGenerator(data_format='channels_last',
                                   rescale=1./255)

test_datagen = ImageDataGenerator(data_format='channels_last',
                                  rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        constants.INPUT_PATH, 
        target_size=(constants.HEIGHT,constants.WIDTH),
        color_mode='rgb',
        class_mode='input',
        batch_size=constants.BATCH_SIZE)

validation_generator = test_datagen.flow_from_directory(
        constants.VALIDATION_PATH, 
        target_size=(constants.HEIGHT,constants.WIDTH),
        color_mode='rgb',
        class_mode='input',
        batch_size=constants.VALIDATION_SIZE)


autoencoder.fit_generator(train_generator,
        steps_per_epoch=constants.NUM_SAMPLES*1.0/constants.BATCH_SIZE,
        epochs=1,
        verbose=2,
        validation_data=validation_generator,
        validation_steps=constants.VALIDATION_SAMPLES*1.0/constants.VALIDATION_SIZE)


# Creates a HDF5 file 'my_model.h5'
autoencoder.save(constants.MODEL_PATH+constants.MODEL_FILE)
with open(constants.MODEL_PATH+constants.EPOCH_FILE, 'w') as f:
    f.write(str(1))

print("Done, model created in: " + constants.MODEL_PATH)

部分错误日志:

2019-01-29 16:40:10.522222: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ***********************************************************************************************_____
2019-01-29 16:40:10.525191: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at matmul_op.cc:478 : Resource exhausted: OOM when allocating tensor with shape[60480,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "init.py", line 53, in <module>
    validation_steps=constants.VALIDATION_SAMPLES*1.0/constants.VALIDATION_SIZE)
  File "C:\Users\dekke\Anaconda3\envs\tensorflow\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\dekke\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "C:\Users\dekke\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\training_generator.py", line 217, in fit_generator
    class_weight=class_weight)
  File "C:\Users\dekke\Anaconda3\envs\tensorflow\lib\site-packages\keras\engine\training.py", line 1217, in train_on_batch
    outputs = self.train_function(ins)
  File "C:\Users\dekke\Anaconda3\envs\tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "C:\Users\dekke\Anaconda3\envs\tensorflow\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "C:\Users\dekke\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
    run_metadata_ptr)
  File "C:\Users\dekke\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1024,60480] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node training/Adam/gradients/dense_4/MatMul_grad/MatMul_1}} = MatMul[T=DT_FLOAT, _class=["loc:@training/Adam/gradients/dense_4/MatMul_grad/MatMul"], transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dense_3/Relu, training/Adam/gradients/dense_4/Relu_grad/ReluGrad)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

【问题讨论】:

  • 只是为了确定,您是否在 清除 tensorflow 模型之后运行了您的 keras 模型?
  • 我没有在同一个会话中运行它们,是否需要在执行完成之前强制清除 tensorflow 模型?
  • 我指的不是同一个会话。除非您采取措施限制 gpu 的使用,或者关闭 tensorflow 会话,否则它将占用 all gpu。你运行的会话tf 还活着吗?
  • 不,它还活着。
  • @Glen654,你有 1 亿个参数。 VGG16 的参数数量与您相似。您在这里尝试的是从头开始训练 VGG16,在 2 张卡上使用 8 GB 的总内存,这可能就是您遇到问题的原因。您是否以与成功实现 tensorflow 时相同的方式构建此模型,包括相同的参数数量?

标签: python tensorflow keras


【解决方案1】:

我不时使用带有 keras 的 anaconda tf_gpu 包得到这个。我认为您要么通过 python 脚本耗尽了可用内存,要么 tensorflow-gpu 试图一次分配大量内存:

我通常把它放在我的导入下,它可以工作:

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config = config)

# Check available GPU devices.
print("The following GPU devices are available: %s" % tf.test.gpu_device_name())

希望这会有所帮助。

【讨论】:

  • 我没有使用 tensorflow 会话,我只是调用 .fit_generator,有没有办法只使用 keras 的 allow_growth?我记得在我的 tensorflow 模型中使用过它。
  • 嘿,格伦是的。我相信是这样。如果您使用的是 jupyter notebook,请将其放在初始单元格的导入下。
  • 我现在在本地运行它,但我正在编写它以在我可以让它工作时在具有完整数据集的 jupyter 笔记本上运行。
  • 我试试看。但是您的网络规模看起来很大,所以可能就是这样。让我知道它是如何为您工作的。
  • 我找到了一些可以访问底层 tensorflow 会话的代码,但我仍然得到相同的结果。还有其他想法吗?感谢您的帮助。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2019-04-15
  • 2021-02-14
  • 2020-08-05
  • 2019-05-29
  • 1970-01-01
  • 1970-01-01
  • 2014-08-02
相关资源
最近更新 更多