TensorFlow 耗尽资源答案

【问题标题】：Tensorflow exhausted resourceTensorFlow 耗尽资源
【发布时间】：2017-06-20 20:26:13
【问题描述】：

我编写了一个 Tensorflow 程序，读取 128x128 图像。该程序在我的笔记本电脑上运行良好，我用它来检查代码是否正常。第一个程序基于 MNIST Tutorial，第二个使用 MNIST 示例进行 convNN。当我尝试在 GPU 上运行它们时，我收到以下错误消息：

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16384,20000]
 [[Node: inputLayer_1/weights/Variable/Adam_1/Assign = Assign[T=DT_FLOAT, _class=["loc:@inputLayer_1/weights/Variable"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](inputLayer_1/weights/Variable/Adam_1, inputLayer_1/weights/Variable/Adam_1/Initializer/Const)]]

根据我一直在网上阅读的内容。我必须在我的测试中使用批次，这是喂食的工作方式：

...........................................
    batchSize  = 40
img_height = 128
img_width  = 128


# 1st function to read images form TF_Record
def getImage(filename):
    # convert filenames to a queue for an input pipeline.
    filenameQ = tf.train.string_input_producer([filename],num_epochs=None)

    # object to read records
    recordReader = tf.TFRecordReader()

    # read the full set of features for a single example
    key, fullExample = recordReader.read(filenameQ)

    # parse the full example into its' component features.
    features = tf.parse_single_example(
        fullExample,
        features={
            'image/height': tf.FixedLenFeature([], tf.int64),
            'image/width': tf.FixedLenFeature([], tf.int64),
            'image/colorspace': tf.FixedLenFeature([], dtype=tf.string,default_value=''),
            'image/channels':  tf.FixedLenFeature([], tf.int64),
            'image/class/label': tf.FixedLenFeature([],tf.int64),
            'image/class/text': tf.FixedLenFeature([], dtype=tf.string,default_value=''),
            'image/format': tf.FixedLenFeature([], dtype=tf.string,default_value=''),
            'image/filename': tf.FixedLenFeature([], dtype=tf.string,default_value=''),
            'image/encoded': tf.FixedLenFeature([], dtype=tf.string, default_value='')
        })

    # now we are going to manipulate the label and image features
    label = features['image/class/label']
    image_buffer = features['image/encoded']
    # Decode the jpeg
    with tf.name_scope('decode_jpeg',[image_buffer], None):
        # decode
        image = tf.image.decode_jpeg(image_buffer, channels=3)

        # and convert to single precision data type
        image = tf.image.convert_image_dtype(image, dtype=tf.float32)
    # cast image into a single array, where each element corresponds to the greyscale
    # value of a single pixel.
    # the "1-.." part inverts the image, so that the background is black.
    image=tf.reshape(1-tf.image.rgb_to_grayscale(image),[img_height*img_width])
    # re-define label as a "one-hot" vector
    # it will be [0,1] or [1,0] here.
    # This approach can easily be extended to more classes.
    label=tf.stack(tf.one_hot(label-1, numberOFclasses))
    return label, image

train_img,train_label = getImage(TF_Records+"/train-00000-of-00001")
validation_img,validation_label=getImage(TF_Records+"/validation-00000-of-00001")
# associate the "label_batch" and "image_batch" objects with a randomly selected batch---
# of labels and images respectively
train_imageBatch, train_labelBatch = tf.train.shuffle_batch([train_img, train_label], batch_size=batchSize,capacity=50,min_after_dequeue=10)

# and similarly for the validation data
validation_imageBatch, validation_labelBatch = tf.train.shuffle_batch([validation_img, validation_label],
                                                batch_size=batchSize,capacity=50,min_after_dequeue=10)

.................................................. .........

    sess.run(tf.global_variables_initializer())

# start the threads used for reading files
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess,coord=coord)

# feeding function
def feed_dict(train):
    if True :
        #img_batch, labels_batch= tf.train.shuffle_batch([train_label,train_img],batch_size=batchSize,capacity=500,min_after_dequeue=200)
        img_batch , labels_batch = sess.run([ train_labelBatch ,train_imageBatch])
        dropoutValue = 0.7
    else:
        #   img_batch,labels_batch = tf.train.shuffle_batch([validation_label,validation_img],batch_size=batchSize,capacity=500,min_after_dequeue=200)
        img_batch,labels_batch = sess.run([ validation_labelBatch,validation_imageBatch])
        dropoutValue = 1
    return {x:img_batch,y_:labels_batch,keep_prob:dropoutValue}

for i  in range(max_numberofiteretion):
    if i%10 == 0:#Run a Test
        summary, acc = sess.run([merged,accuracy],feed_dict=feed_dict(False))
        test_writer.add_summary(summary,i)# Save to TensorBoard
    else: # Training
      if i % 100 == 99:  # Record execution stats
        run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
        run_metadata = tf.RunMetadata()
        summary, _ = sess.run([merged, train_step],
                              feed_dict=feed_dict(True),
                              options=run_options,
                              run_metadata=run_metadata)
        train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
        train_writer.add_summary(summary, i)
        print('Adding run metadata for', i)
      else:  # Record a summary
        summary, _ = sess.run([merged, train_step], feed_dict=feed_dict(True))
        train_writer.add_summary(summary, i)

# finalise
coord.request_stop()
coord.join(threads)
train_writer.close()
test_writer.close()

.................................................. ...

验证文件夹包含 2100 个文件，所以是的，我明白这太多了，

我找到了这个suggestion

config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC'
with tf.Session(config = config) as s:......

但这并没有解决问题！知道如何解决这个问题吗？

【问题讨论】：

标签： python tensorflow

【解决方案1】：

问题似乎是图中的所有内容都是在 GPU 上完成的。您应该将 CPU 资源用于预处理功能和 GPU 上的图形的其余部分。因此，让 getImage() 和队列等输入处理函数在 CPU 而不是 GPU 上运行。基本上，当 GPU 处理张量时，CPU 应该填充输入管道队列，因此 CPU 和 GPU 都得到有效使用。这在 tensorflow 性能指南中有解释：

在 CPU 上进行预处理可以导致 6 倍以上的增加样本/秒处理，这可能会导致 1/6 的训练时间。 https://www.tensorflow.org/performance/performance_guide

例如，您可以像这样创建一个在 CPU 上运行的函数 get_batch：

def get_batch(dataset):
      with tf.device('/cpu:0'):
          'File Name Queue'
          'Get image function implementation'
          'Shuffle batch to make batches'
     return image, labels
train_imageBatch, train_labelBatch = get_batch('train_dataset')
validation_imageBatch, validation_labelBatch = get_batch('valid_dataset')

还可以查看以下链接，了解如何在使用队列时在测试和验证之间切换：Tensorflow Queues - Switching between train and validation data。你的代码应该是这样的：

# A bool tensor to figure out whether in training loop or tesing loop
_is_train = tf.placeholder(dtype=tf.bool, name='is_train') 

# Select train or test database based on the _is_train tensor
images = tf.cond(_is_train, lambda:train_imageBatch, lambda:validation_imageBatch)
labels = tf.cond(_is_train, lambda:train_labelBatch, lambda:validation_labelBatch)

train_op = ...
...
for step in num_steps:

    # each step
    summary, _ = sess.run([merged, train_step], fead_dict={_is_train:True}
    ...
    if (validate_step)
      summary, acc = sess.run([merged,accuracy],feed_dict={_is_train:False)
      ...

get_batch的实现，可以从tensorflow看这个例子：https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/how_tos/reading_data/fully_connected_reader.py。

【讨论】：

感谢您的回复，我还需要一些澄清。你能解释一下如何在示例中使用 _is_train 我的意思是调用 sess.run 函数。第二，在这种情况下，队列在这种情况下如何工作，在我的示例中，我每次运行验证和训练批处理以确保调用新数据，我不明白它在您的示例中将如何工作。非常感谢您的帮助
我已经试过了，但是喂食没有用。但资源已解决。谢谢