在 tensorflow deep mnist 示例中使用 gpu vs cpu答案

【问题标题】：Using gpu vs cpu in tensorflow deep mnist example在 tensorflow deep mnist 示例中使用 gpu vs cpu
【发布时间】：2017-12-05 17:44:29
【问题描述】：

我正在使用的程序是从here 复制粘贴的，并进行了一些更改。这是我的代码，旨在提高训练速度：

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

import tensorflow as tf

x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))

def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')

with tf.device('/gpu:0'):
  W_conv1 = weight_variable([5, 5, 1, 32])
  b_conv1 = bias_variable([32])
  x_image = tf.reshape(x, [-1, 28, 28, 1])
  h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
  h_pool1 = max_pool_2x2(h_conv1)

  W_conv2 = weight_variable([5, 5, 32, 64])
  b_conv2 = bias_variable([64])

  h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
  h_pool2 = max_pool_2x2(h_conv2)

  W_fc1 = weight_variable([7 * 7 * 64, 1024])
  b_fc1 = bias_variable([1024])

  h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
  h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

  keep_prob = tf.placeholder(tf.float32)
  h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

  W_fc2 = weight_variable([1024, 10])
  b_fc2 = bias_variable([10])

  y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

  cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
  train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
  correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
  accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

  with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)) as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(20000):
      batch = mnist.train.next_batch(50)
      if i % 100 == 0:
        train_accuracy = accuracy.eval(feed_dict={
          x: batch[0], y_: batch[1], keep_prob: 1.0})
        print('step %d, training accuracy %g' % (i, train_accuracy))
      train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

  print('test accuracy %g' % accuracy.eval(feed_dict={
      x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

产生以下输出：

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
step 0, training accuracy 0.22
step 100, training accuracy 0.76
step 200, training accuracy 0.88
...

问题是教程上的原始代码所花费的时间（即没有 with tf.device('/gpu:0'): on line 26）和这段代码没有可测量的差异（大约 10 秒每一步）。我已经成功安装了 cuda-8.0 和 cuDNN（经过数小时的失败尝试）。 "$ nvidia-smi" 返回以下输出：

Sun Jul  2 13:57:10 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 710      Off  | 0000:01:00.0     N/A |                  N/A |
| N/A   49C    P0    N/A /  N/A |    406MiB /  2000MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
+-----------------------------------------------------------------------------+

所以问题是：

1) 工作量是否太小以至于在选择 cpu 或 gpu 时没有区别？ 2) 还是我的实现中有一些愚蠢的错误？

感谢您阅读整个问题。

【问题讨论】：

这只是意味着 GPU 在可用时默认使用。您应该明确地使用 CPU 来衡量差异。
感谢@user1735003。我尝试了您的建议（用 cpu 替换 gpu）。结果是每一步都多出了 5 秒。它应该更快，对吧？此外，当我从网站复制粘贴原始代码并将其与上述代码进行比较时，没有明显的差异。你能告诉我为什么吗？

标签： python machine-learning tensorflow gpu conv-neural-network

【解决方案1】：

您可以毫无错误地运行此代码这一事实表明 TensorFlow 绝对可以在 GPU 上运行。这里的问题是，当您按原样运行 TensorFlow 时，默认情况下，它会尝试在 GPU 上运行。有几种方法可以强制它在 CPU 上运行。

以这种方式运行：CUDA_VISIBLE_DEVICES= python code.py。请注意，当您这样做并且仍然有with tf.device('/gpu:0') 时，它会损坏，因此请删除它。
将with tf.device('/gpu:0') 更改为with tf.device('/cpu:0')

编辑 cmets 中的问题

有关 allow_soft_placement 和 log_device_placement 在 ConfigProto 中的含义的更多信息，请参阅 here。

【讨论】：

很抱歉不清楚@jkschin，但即使我在会话的括号内没有提到config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)，“TensorFlow 原样，默认情况下，它会尝试在 GPU 上运行”这句话仍然成立。
这些参数不影响它是否在 GPU 上运行。请参阅here 了解更多信息。
请在答案中添加您的最后一条评论（供未来的谷歌员工使用）@jkschin