GPU上的Tensorflow matmul计算比CPU慢答案

【问题标题】：Tensorflow matmul calculations on GPU are slower than on CPUGPU上的Tensorflow matmul计算比CPU慢
【发布时间】：2017-04-05 22:18:42
【问题描述】：

我是第一次尝试 GPU 计算，当然希望能有很大的提速。然而，在 tensorflow 中的一个基本示例，它实际上更糟：

在 cpu:0 上，十次运行中的每一次平均需要 2 秒，gpu:0 需要 2.7 秒，而 gpu:1 比 cpu:0 3 秒差 50%。

代码如下：

import tensorflow as tf
import numpy as np
import time
import random

for _ in range(10):
    with tf.Session() as sess:
        start = time.time()
        with tf.device('/gpu:0'): # swap for 'cpu:0' or whatever
            a = tf.constant([random.random() for _ in xrange(1000 *1000)], shape=[1000, 1000], name='a')
            b = tf.constant([random.random() for _ in xrange(1000 *1000)], shape=[1000, 1000], name='b')
            c = tf.matmul(a, b)
            d = tf.matmul(a, c)
            e = tf.matmul(a, d)
            f = tf.matmul(a, e)
            for _ in range(1000):
                sess.run(f)
        end = time.time()
        print(end - start)

我在这里观察到什么？运行时间是否主要由在 RAM 和 GPU 之间复制数据来控制？

【问题讨论】：

尝试增加矩阵并查看nvidia-smi 中的gpu 使用情况与top 中的cpu 使用情况。
@sygi 谢谢，我不知道nvidia-smi。它显示 GPU-Util 不超过 2%。不过，python 似乎占用了大部分内存。 40W / 180W的用电量相当稳定
看来你写的代码不是gpu绑定的。您可以尝试将a 和b 更改为tf.random_uniform([1000, 1000]) 吗？就内存而言，TF 默认占用所有 GPU 内存（恶心！），但有一个选项可以通过强制动态分配。
@sygi 使用 random_uniform 明显更快，非常有趣！

标签： python performance tensorflow gpu

【解决方案1】：

您用来生成数据的方式是在 CPU 上执行的（random.random() 是一个常规的 Python 函数，而不是 TF-one）。此外，执行它10^6 次将比一次请求10^6 随机数慢。将代码更改为：

a = tf.random_uniform([1000, 1000], name='a')
b = tf.random_uniform([1000, 1000], name='b')

这样数据将在 GPU 上并行生成，不会浪费时间将其从 RAM 传输到 GPU。

【讨论】：