TensorFlow 与 Numpy 性能答案

【问题标题】：Tensorflow vs. Numpy PerformanceTensorFlow 与 Numpy 性能
【发布时间】：2017-07-30 20:46:36
【问题描述】：

我正在计算 numpy 中的均值和标准差。为了提高性能，我在 Tensorflow 中尝试了相同的方法，但 Tensorflow 至少慢了约 10 倍。我在 Tensorflow 中尝试了 2 种方法（代码如下）。第一种方法使用tf.nn.moments()，它有一个错误导致它有时会返回负值作为方差。在第二种方法中，我通过其他 Tensorflow 函数计算方差。

我尝试了纯 CPU 和 GPU； numpy 总是更快。

在使用 GPU 时，我使用 time.time() 而不是 time.clock() 来测量挂钟时间。

为什么 TensorFlow 变慢了？我认为这可能是由于将数据传输到 GPU 造成的，但即使对于非常小的数据集（传输时间应该可以忽略不计）以及仅使用 CPU 时，TF 也会更慢。这是因为初始化 TF 需要额外的时间吗？

import tensorflow as tf
import numpy
import time
import math

class Timer:
    def __enter__(self):
        self.start = time.time()
        return self

    def __exit__(self, *args):
        self.end = time.time()
        self.interval = self.end - self.start

inData = numpy.random.uniform(low=-1, high=1, size=(40000000,))

with Timer() as t:
    mean = numpy.mean(inData)
print 'python mean', mean, 'time', t.interval

with Timer() as t:
    stdev = numpy.std(inData)
print 'python stdev', stdev, 'time', t.interval

# Approach 1 (Note tf.nn.moments() has a bug)
with Timer() as t:
    with tf.Graph().as_default():
        meanTF, varianceTF = tf.nn.moments(tf.constant(inData), axes=[0])
        init_op = tf.global_variables_initializer()
        with tf.Session() as sess:
            sess.run(init_op)
            mean, variance = sess.run([meanTF, varianceTF])
            sess.close()
print 'variance', variance
stdev = math.sqrt(variance)
print 'tensorflow mean', mean, 'stdev', stdev, 'time', t.interval

# Approach 2
with Timer() as t:
    with tf.Graph().as_default():
        inputVector = tf.constant(inData)
        meanTF = tf.reduce_mean(inputVector)
        length = tf.size(inputVector)
        varianceTF = tf.divide(tf.reduce_sum(tf.squared_difference(inputVector, mean)), tf.to_double(length))
        init_op = tf.global_variables_initializer()
        with tf.Session() as sess:
            sess.run(init_op)
            mean, variance = sess.run([meanTF, varianceTF])
            sess.close()
print 'variance', variance
stdev = math.sqrt(variance)
print 'tensorflow mean', mean, 'stdev', stdev, 'time', t.interval

【问题讨论】：

I thought it might be due to transferring data into the GPU, but TF is slower even for very small datasets 这看起来，就像你交换了一些东西。我想说您的计算类型很简单，因此由于专用功能和BLAS 的使用，numpy 非常好地达到了极限（这可能会根据您的 BLAS 设置并行运行；例如 Ubuntu 中的默认值）。 Tensorflow 不能做得更好（同时保证相同的准确性）。
在我的测试中，Tensorflow 始终比 Numpy 慢得多。 Tensorflow 不应该更快，因为它使用 GPU 而 Numpy 只使用 CPU？我正在运行 Ubuntu，并且没有更改任何影响 BLAS 的内容（据我所知）。
这总是取决于任务。有些算法很好地并行完成，有些则不是（你已经提到了其他参数，如传输，还有 dtypes 和 co。）。对 GPU 来说，并不是所有的工作都很好。

标签： performance numpy tensorflow

【解决方案1】：

下面是一个稍微好一点的基准。在 Xeon V3 上测试，仅使用 TensorFlow CPU 编译，所有优化选项 + XLA 来自 here 与最新 anaconda 附带的 numpy MKL。

XLA 可能在这里没有什么不同，但留给后代使用。

注意事项：

从计时中排除前几次运行，它们可以包括初始化/分析
使用变量避免将输入复制到 Tensorflow 运行时。
在调用之间扰动变量以确保没有缓存

结果：

   numpy 23.5 ms, 25.7 ms
      tf 14.7 ms, 20.5 ms

代码：

import numpy as np
import tensorflow as tf
import time
from tensorflow.contrib.compiler import jit
jit_scope = jit.experimental_jit_scope

inData = np.random.uniform(low=-1, high=1, size=(40000000,)).astype(np.float32)
#inDataFeed = tf.placeholder(inData.dtype)

with jit_scope(compile_ops=True):
    inDataVar = tf.Variable(inData)
    meanTF = tf.reduce_mean(inDataVar)


sess = tf.Session()
times = []
sess.run(tf.global_variables_initializer())
num_tries = 10


times = []
for i in range(num_tries):
    t0 = time.perf_counter()
    mean = np.mean(inData)
    times.append(time.perf_counter()-t0)

print("%10s %.1f ms, %.1f ms" %("numpy", 10**3*min(times),
                                10**3*np.median(times)))

times = []
perturb = inDataVar.assign_add(tf.random_uniform(inData.shape))
for i in range(num_tries):
    sess.run(perturb)
    t0 = time.perf_counter()
    mean, = sess.run([meanTF])
    times.append(time.perf_counter()-t0)

times = times[2:] # discard first few because they could include profiling runs
print("%10s %.1f ms, %.1f ms" %("tf", 10**3*min(times),
                                10**3*np.median(times)))

【讨论】：

谢谢雅罗斯拉夫。我最初的目标是通过 GPU 获得性能提升。您知道将数据加载到 GPU 和启动 GPU 会话是否涉及大量开销时间？
是的，启动 GPU 会话的开销很大，在某些不幸的情况下可能需要超过 30 秒（当您使用尚未编译计算能力的显卡时，例如 GTX 1080 的情况）

【解决方案2】：

这是来自claims that TF mean is significantly faster than in numpy or theano 的某个人的基准测试。基准是 here 并在

上进行了测试

一个带有 16GiB RAM 的 Intel core i5-4460 CPU 和一个带有 4 个 Nvidia GTX 970 GiB RAM 在 Linux Mint 上使用 Theano 0.8.2、Tensorflow 0.11.0、CUDA 8.0 18

这里是some other benchmarks，但它们并不代表意思。

【讨论】：

【解决方案3】：

请在https://towardsdatascience.com/numpy-vs-tensorflow-speed-on-matrix-calculations-9cbff6b3ce04找到另一个基准和解释

【讨论】：