训练 CNN 时 GPU 使用率低答案

【问题标题】：Low GPU usage when training a CNN训练 CNN 时 GPU 使用率低
【发布时间】：2018-03-17 16:14:27
【问题描述】：

我刚刚安装了 tensorflow gpu，并开始训练我的卷积神经网络。问题是我的 gpu 使用百分比一直为 0%，有时会增加到 20%。 CPU 在 20% 左右，磁盘在 60% 以上。我尝试测试是否安装正确，并进行了一些矩阵乘法，在这种情况下，一切正常，GPU 使用率超过 90%。

with tf.device("/gpu:0"):
    #here I set up the computational graph

当我运行图表时，我使用它，因此编译器将决定一个操作是否具有 gpu 实现

with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:

我有一块 NVIDIA GEFORCE GTX 950m 显卡，我在运行时没有遇到错误。我做错了什么？

稍后编辑，我的计算图

with tf.device("/gpu:0"):
    X = tf.placeholder(tf.float32, shape=[None, height, width, channels], name="X")
    dropout_rate= 0.3


    training = tf.placeholder_with_default(False, shape=(), name="training")
    X_drop = tf.layers.dropout(X, dropout_rate, training = training)

    y = tf.placeholder(tf.int32, shape = [None], name="y")


    conv1 = tf.layers.conv2d(X_drop, filters=32, kernel_size=3,
                            strides=1, padding="SAME",
                            activation=tf.nn.relu, name="conv1")

    conv2 = tf.layers.conv2d(conv1, filters=64, kernel_size=3,
                            strides=2, padding="SAME",
                            activation=tf.nn.relu, name="conv2")

    pool3 = tf.nn.max_pool(conv2,
                            ksize=[1, 2, 2, 1],
                            strides=[1, 2, 2, 1],
                            padding="VALID")

    conv4 = tf.layers.conv2d(pool3, filters=128, kernel_size=4,
                            strides=3, padding="SAME",
                            activation=tf.nn.relu, name="conv4")

    pool5 = tf.nn.max_pool(conv4,
                            ksize=[1, 2, 2, 1],
                            strides=[1, 1, 1, 1],
                            padding="VALID")


    pool5_flat = tf.reshape(pool5, shape = [-1, 128*2*2])

    fullyconn1 = tf.layers.dense(pool5_flat, 128, activation=tf.nn.relu, name = "fc1")
    fullyconn2 = tf.layers.dense(fullyconn1, 64, activation=tf.nn.relu, name = "fc2")

    logits = tf.layers.dense(fullyconn2, 2, name="output")

    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)

    loss = tf.reduce_mean(xentropy)
    optimizer = tf.train.AdamOptimizer()
    training_op = optimizer.minimize(loss)

    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

    init = tf.global_variables_initializer()
saver = tf.train.Saver()

hm_epochs = 100
config = tf.ConfigProto(allow_soft_placement=True)
config.gpu_options.allow_growth = True

批量大小为 128

with tf.Session(config=config) as sess:
        tbWriter = tf.summary.FileWriter(logPath, sess.graph)
        dataset = tf.data.Dataset.from_tensor_slices((training_images, training_labels))
        dataset = dataset.map(rd.decodeAndResize)
        dataset = dataset.batch(batch_size)

        testset = tf.data.Dataset.from_tensor_slices((test_images, test_labels))
        testset = testset.map(rd.decodeAndResize)
        testset = testset.batch(len(test_images))

        iterator = dataset.make_initializable_iterator()
        test_iterator = testset.make_initializable_iterator()
        next_element = iterator.get_next()
        sess.run(tf.global_variables_initializer())
        for epoch in range(hm_epochs):
            epoch_loss = 0
            sess.run(iterator.initializer)
            while True:
                try:
                    epoch_x, epoch_y = sess.run(next_element)
                    # _, c = sess.run([optimizer, cost], feed_dict={x: epoch_x, y: epoch_y})
                    # epoch_loss += c
                    sess.run(training_op, feed_dict={X:epoch_x, y:epoch_y, training:True})
                except tf.errors.OutOfRangeError:
                    break


            sess.run(test_iterator.initializer)
            # acc_train = accuracy.eval(feed_dict={X:epoch_x, y:epoch_y})
            try:
                next_test = test_iterator.get_next()
                test_images, test_labels = sess.run(next_test)
                acc_test = accuracy.eval(feed_dict={X:test_images, y:test_labels})
                print("Epoch {0}: Train accuracy {1}".format(epoch, acc_test))
            except tf.errors.OutOfRangeError:
                break
            # print("Epoch {0}: Train accuracy {1}, Test accuracy: {2}".format(epoch, acc_train, acc_test))
        save_path = saver.save(sess, "./my_first_model")

我有 9k 张训练图片和 3k 张测试图片

【问题讨论】：

这可能有很多原因，如果没有更多的细节和代码，就无法准确说出原因。一种可能的解释是，输入和准备输入数据批次需要花费大量时间（这通常在 CPU 上完成）。同时，GPU 处于空闲状态，等待处理。
嗨 mikkola，谢谢您的回复。我编辑了帖子并添加了代码。

标签： python tensorflow tensorflow-datasets

【解决方案1】：

您的代码中存在一些可能导致 GPU 使用率低的问题。

1) 在Dataset 流水线的末尾添加一条prefetch 指令，使 CPU 能够保持输入数据批次的缓冲区，以便将它们移动到 GPU。

# this should be the last thing in your pipeline
dataset = dataset.prefetch(1)

2) 您正在使用 feed_dict 和 Dataset 迭代器来提供模型。这不是预期的方式！ feed_dict is the slowest method of inputting data to your model and not recommended。您应该根据迭代器的next_element 输出定义您的模型。

例子：

next_x, next_y = iterator.get_next()
with tf.device('/GPU:0'):
    conv1 = tf.layers.conv2d(next_x, filters=32, kernel_size=3,
                        strides=1, padding="SAME",
                        activation=tf.nn.relu, name="conv1")
    # rest of model here...
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, 
                 labels=next_y)

然后您可以使用feed_dict不调用您的训练操作，迭代器将在后台处理向您的模型提供数据。 Here is another related Q&A。你的新训练循环看起来像这样：

while True:
    try:
        sess.run(training_op, feed_dict={training:True})
    except tf.errors.OutOfRangeError:
        break

您应该只通过feed_dict 输入您的迭代器未提供的数据，并且这些数据通常应该非常轻量级。

有关性能的更多提示，您可以参考this guide on TF website。

【讨论】：

非常感谢您的帮助，我会更改这些内容，我会回复您。
我有一个问题。如果我将迭代器集成到我的计算图中，我如何输入测试数据？
@LaciSzakács 您可以使用可重新初始化或可馈送的迭代器。在此处查看深入指南：tensorflow.org/programmers_guide/datasets#creating_an_iterator
我设法使用了一个可重新初始化的迭代器。现在我的使用率定期增长到 20%……但仍然没有使用 100% 的 gpu :(

【解决方案2】：

您可以尝试以下代码，看看 tensorflow 是否正在识别您的 GPU：

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

【讨论】：

您好，谢谢您的回答。我尝试了该代码，tensorflow 识别出了我的 gtx gpu 卡