TensorFlow 2-gpu 比单个 gpu 慢答案

【问题标题】：TensorFlow 2-gpu slower then single gpuTensorFlow 2-gpu 比单个 gpu 慢
【发布时间】：2016-11-17 01:10:33
【问题描述】：

我有两个 gpu（TitanX (Pascal) 和 GTX 1080）。我在尝试运行单线程图计算。该图是两个独立的矩阵乘法链（每个都分配给相应的 gpu）。

这是我正在使用的代码：

将张量流导入为 tf 将 numpy 导入为 np 随机导入进口时间导入日志

from tensorflow.python.ops import init_ops
from tensorflow.python.client import timeline


def test():
    n = 5000

    with tf.Graph().as_default():
        A1 = tf.placeholder(tf.float32, shape=[n, n], name='A')
        A2 = tf.placeholder(tf.float32, shape=[n, n], name='A')
        with tf.device('/gpu:0'):
            B1 = A1
            for l in xrange(10):
                B1 = tf.matmul(B1, A1)

        with tf.device('/gpu:1'):
            B2 = A2
            for l in xrange(10):
                B2 = tf.matmul(B2, A2)
            C = tf.matmul(B1, B2)

        run_metadata = tf.RunMetadata()
        with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
            start = time.time()
            logging.info('started')
            A1_ = np.random.rand(n, n)
            A2_ = np.random.rand(n, n)
            sess.run([C],
                     feed_dict={A1: A1_, A2: A2_},
                     options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
                     run_metadata=run_metadata)
            logging.info('writing trace')
            trace = timeline.Timeline(step_stats=run_metadata.step_stats)
            trace_file = open('timeline.ctf.json', 'w')
            trace_file.write(trace.generate_chrome_trace_format())
            logging.info('trace written')
            end = time.time()
            logging.info('computed')
            logging.info(end - start)


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
    test()

完成需要 20.4 秒。
如果我将所有操作设置为 gpu0 (TitanX)，则需要 14 秒才能完成。
如果我将所有操作设置为 gpu1 (GTX 1080)，则需要 19.8 秒才能完成。

我可以看到 tensorflow 找到了 gpus 并正确设置了所有设备。为什么使用两个 gpu 而不是一个 gpu 没有加速？可能是 gpus 是不同模型的问题（AFAIK cuda 允许）？

谢谢。

编辑我更新了代码以对两条链使用不同的初始矩阵，否则 tensorflow 似乎做了一些优化。

这是一个时间线配置文件 json 文件链接：https://api.myjson.com/bins/23csi

Screenshot

这个时间表提出的问题多于答案：

为什么 pid 7 (gpu0) 有两行执行？
pid 3 和 5 中的长 MatMul 是什么？（输入0“_recv_A_0/_3”，输入1“_recv_A_0/_3”，名称“MatMul”，操作“MatMul”）
似乎每个操作都在 gpu0 上执行，除了 pid 5。
在 pid 3 和 pid 5 的长 MatMul 操作之后，有很多小的 MatMul 操作（从屏幕截图中看不到）。这是什么？

【问题讨论】：

你可以看一下时间线，看看瓶颈在哪里github.com/tensorflow/tensorflow/issues/…
另外，你可以用sess.run(C.op) 代替sess.run(C)，TensorFlow->Python 从你的时间转移
我在 Session 构造函数中收到错误“TypeError: __init__() got an unexpected keyword argument 'run_metadata''。我在 16 年 9 月从源代码安装了 tensorflow，并尝试从 pip 重新安装它（仍然有同样的错误）。
@YaroslavBulatov 感谢时间线分析器的建议。我已经更新了帖子，提出了更多问题。
gpus 有多个流，所以有些东西在显示中是重复的——它可以在 GPU 的“计算”通道以及专用的“流”通道中显示相同的计算。长时间的操作可能是由于初始内核启动开销、发布的带有预热的基准

标签： tensorflow

【解决方案1】：

在 GPU 上首次启动内核时出现明显延迟，可能是由 PTXAS 编译引起的。当您使用超过 1 个 GPU 时，这种延迟可能会达到几秒钟，并且会累积，因此在您的情况下，运行速度会变慢，因为时间主要由额外的“初始内核启动”控制。对纯计算时间进行基准测试的一种方法是通过在每个 GPU 上执行每个 cuda 操作至少一次来“预热”。通过在 2 个 TitanX 卡上运行您的基准测试，我观察到同样的缓慢，但是当我“预热”内核时，这种延迟消失了。

预热前如下：

预热后如下：下面是您修改的代码以进行预热，并删除任何 TensorFlowPython 传输。

import tensorflow as tf

from tensorflow.python.ops import init_ops
from tensorflow.python.client import timeline
import logging, time
import numpy as np

def test():
    n = 5000

    with tf.device('/gpu:0'):
        A1 = tf.Variable(tf.ones_initializer(shape=[n, n]), name='A1')
        B1 = A1
        for l in xrange(10):
            B1 = tf.matmul(A1, B1, name="chain1")

    with tf.device('/gpu:1'):
        A2 = tf.Variable(tf.ones_initializer(shape=[n, n]), name='A2')
        B2 = A2
        for l in xrange(10):
            B2 = tf.matmul(A2, B2, name="chain2")
        C = tf.matmul(B1, B2)

    run_metadata = tf.RunMetadata()
    start = time.time()
    logging.info('started')
    sess = tf.InteractiveSession(config=tf.ConfigProto(allow_soft_placement=False, log_device_placement=True))
    sess.run(tf.initialize_all_variables())
    # do warm-run
    sess.run([C.op],
             options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
             run_metadata=run_metadata)
    run_metadata = tf.RunMetadata()
    sess.run([C.op],
             options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE),
             run_metadata=run_metadata)
    logging.info('writing trace')
    trace = timeline.Timeline(step_stats=run_metadata.step_stats)
    trace_file = open('timeline.ctf.json', 'w')
    trace_file.write(trace.generate_chrome_trace_format(show_memory=True))
    logging.info('trace written')
    end = time.time()
    logging.info('computed')
    logging.info(end - start)


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
    test()

【讨论】：

顺便说一句，标签是一个误导性的“gpu:0/stream:22”在时间轴上实际上是在 gpu:1 上，从 log_device_placement 中可以看出
感谢您的澄清。但是这些时间线痕迹对我来说仍然很奇怪。为什么如果我在 gpu0 上分配所有操作，那么它们总是被计算出来（我尝试了从 5000 到 20000 的不同矩阵大小，不同的链长度最多为 100）？似乎这两个链可以在两个并行流中计算，即使在单个 gpu 上也是如此。
没错，tensorflow 不会在并行流上调度 ops，而是每个 op 都可以使用 GPU 的所有流

【解决方案2】：

不是因为计算C时需要在GPU之间传输数据吗？您可以尝试将C 放在cpu 上吗？

with tf.device('/cpu:0'):
  C = tf.matmul(B1, B2)

【讨论】：

没用。另外我认为这应该没关系，因为在最后一个 gpu 之前每个 gpu 有 500 个矩阵 mlt。