【发布时间】:2018-04-07 21:01:02
【问题描述】:
所以我尝试在 Keras 中使用多个 GPU。当我使用示例程序运行training_utils.py 时(在training_utils.py 代码中作为cmets 给出),我最终得到ResourceExhaustedError。 nvidia-smi 告诉我,四个 GPU 中几乎没有一个在工作。使用一个 GPU 可以很好地处理其他程序。
- TensorFlow 1.3.0
- Keras 2.0.8
- Ubuntu 16.04
- CUDA/cuDNN 8.0/6.0
问题:有人知道这里发生了什么吗?
控制台输出:
(...)
2017-10-26 14:39:02.086838: W tensorflow/core/common_runtime/bfc_allocator.cc:277] *********************** ****************************************************** **************************X 2017-10-26 14:39:02.086857:W tensorflow/core/framework/op_kernel.cc:1192] 资源耗尽:分配具有形状的张量时出现 OOM [128,55,55,256] 回溯(最近一次通话最后): 文件“test.py”,第 27 行,在 parallel_model.fit(x, y, epochs=20, batch_size=256) 文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py”,第 1631 行,适合 验证步骤=验证步骤) _fit_loop 中的文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py”,第 1213 行 出局= f(ins_batch) 调用中的文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py”,第 2331 行 **self.session_kwargs) 运行中的文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py”,第 895 行 run_metadata_ptr) _run 中的文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py”,第 1124 行 feed_dict_tensor、选项、run_metadata) _do_run 中的文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py”,第 1321 行 选项,run_metadata) _do_call 中的文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py”,第 1340 行 raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError:OOM 分配具有形状[128,55,55,256] 的张量时 [[节点:replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/作业:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]] [[节点:training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device=" /job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/作业:本地主机/副本:0/任务:0/cpu:0"]]
由op u'replica_1/xception/block3_sepconv2/separable_conv2d'引起, 定义在:文件“test.py”,第 19 行,在 parallel_model = multi_gpu_model(model, gpus=2) 文件 "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/utils/training_utils.py", 第 143 行,在 multi_gpu_model 输出=模型(输入)文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py”, 第 603 行,在 调用 输出 = self.call(inputs, **kwargs) 文件 "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", 第 2061 行,通话中 output_tensors, _, _ = self.run_internal_graph(inputs, mask) 文件 “/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py”, 第 2212 行,在 run_internal_graph output_tensors = _to_list(layer.call(computed_tensor, **kwargs)) 文件 “/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/convolutional.py”, 第 1221 行,通话中 dilation_rate=self.dilation_rate) 文件 "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", 第 3279 行,在 separable_conv2d 中 data_format=tf_data_format) 文件 "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_impl.py", 第 497 行,在 separable_conv2d 中 名称=名称)文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py”, 第 397 行,在 conv2d 中 数据格式=数据格式,名称=名称)文件“/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py”, 第 767 行,在 apply_op 中 op_def=op_def) 文件 "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", 第 2630 行,在 create_op 中 original_op=self._default_original_op, op_def=op_def) 文件 "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", 第 1204 行,在 init 中 self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError(回溯见上文):分配时出现 OOM 形状为 [128,55,55,256] [[节点: replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]] [[节点: 训练/RMSprop/梯度/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
编辑(添加示例代码):
import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np
num_samples = 1000
height = 224
width = 224
num_classes = 100
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
parallel_model.fit(x, y, epochs=20, batch_size=128)
【问题讨论】:
-
您的模型对于您的设备来说太大了,或者您正在使用包含太多元素的批次。
-
你是对的,当我使用较小的批量大小时它可以工作。我正在使用四个 GTX 1080 Ti,所以我认为运行示例程序不会导致大小问题。默认情况下,示例程序适用于 8 个 GPU。但是这个程序占用了多少空间,你是怎么计算的呢?真的超过 11 Gig x 4 吗?
-
我不知道如何计算....我认为最好的是旧的尝试和失败的方法。
-
好的,谢谢。你的回答解决了我的问题。
标签: tensorflow keras multi-gpu