在 Tensorflow GPU 中训练一个简单模型比 CPU 慢答案

【问题标题】：Training a simple model in Tensorflow GPU slower than CPU在 Tensorflow GPU 中训练一个简单模型比 CPU 慢
【发布时间】：2023-05-02 20:23:01
【问题描述】：

我在 Tensorflow 中设置了一个简单的线性回归问题，并在 1.13.1 中使用 Tensorflow CPU 和 GPU 创建了简单的 conda 环境（在 NVIDIA Quadro P600 的后端使用 CUDA 10.0）。

但是，GPU 环境似乎总是比 CPU 环境花费更长的时间。我正在运行的代码如下。

import time
import warnings
import numpy as np
import scipy

import tensorflow as tf
import tensorflow_probability as tfp

from tensorflow_probability import edward2 as ed
from tensorflow.python.ops import control_flow_ops
from tensorflow_probability import distributions as tfd



# Handy snippet to reset the global graph and global session.
def reset_g():
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        tf.reset_default_graph()
        try:
            sess.close()
        except:
            pass


N = 35000
inttest = np.ones(N).reshape(N, 1)
stddev_raw = 0.09

true_int = 1.
true_b1 = 0.15
true_b2 = 0.7

np.random.seed(69)

X1 = (np.atleast_2d(np.linspace(
    0., 2., num=N)).T).astype(np.float64)
X2 = (np.atleast_2d(np.linspace(
    2., 1., num=N)).T).astype(np.float64)
Ytest = true_int + (true_b1*X1) + (true_b2*X2) + \
    np.random.normal(size=N, scale=stddev_raw).reshape(N, 1)

Ytest = Ytest.reshape(N, )
X1 = X1.reshape(N, )
X2 = X2.reshape(N, )

reset_g()

# Create data and param
model_X1 = tf.placeholder(dtype=tf.float64, shape=[N, ])
model_X2 = tf.placeholder(dtype=tf.float64, shape=[N, ])
model_Y = tf.placeholder(dtype=tf.float64, shape=[N, ])

alpha = tf.get_variable(shape=[1], name='alpha', dtype=tf.float64)
# these two params need shape of one if using trainable distro
beta1 = tf.get_variable(shape=[1], name='beta1', dtype=tf.float64)
beta2 = tf.get_variable(shape=[1], name='beta2', dtype=tf.float64)

# Yhat
tf_pred = (tf.multiply(model_X1, beta1) + tf.multiply(model_X2, beta2) + alpha)


# # Make difference of squares
# resid = tf.square(model_Y - tf_pred)
# loss = tf.reduce_sum(resid)

# # Make a Likelihood function based on simple stuff
stddev = tf.square(tf.get_variable(shape=[1],
                                    name='stddev', dtype=tf.float64))
covar = tfd.Normal(loc=model_Y, scale=stddev)
loss = -1.0*tf.reduce_sum(covar.log_prob(tf_pred))



# Trainer
lr=0.005
N_ITER = 20000

opt = tf.train.AdamOptimizer(lr, beta1=0.95, beta2=0.95)
train = opt.minimize(loss)


with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    start = time.time()
    for step in range(N_ITER):
        out_l, out_b1, out_b2, out_a, laws = sess.run([train, beta1, beta2, alpha, loss],
                                                  feed_dict={model_X1: X1,
                                                             model_X2: X2,
                                                             model_Y: Ytest})

        if step % 500 == 0:
            print('Step: {s}, loss = {l}, alpha = {a:.3f}, beta1 = {b1:.3f}, beta2 = {b2:.3f}'.format(
                s=step, l=laws, a=out_a[0], b1=out_b1[0], b2=out_b2[0]))
    print(f"True: alpha = {true_int}, beta1 = {true_b1}, beta2 = {true_b2}")
    end = time.time()
    print(end-start)

如果它们能表明正在发生的事情，则打印以下输出：

对于 CPU 运行：

Colocations handled automatically by placer.
2019-04-18 09:00:56.329669: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-04-18 09:00:56.351151: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2904000000 Hz
2019-04-18 09:00:56.351672: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x558fefe604c0 executing computations on platform Host. Devices:
2019-04-18 09:00:56.351698: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>

对于 GPU 运行：

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
W0418 09:03:21.674947 139956864096064 deprecation.py:506] From /home/sadatnfs/.conda/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/training/slot_creator.py:187: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
2019-04-18 09:03:21.712913: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-04-18 09:03:21.717598: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-04-18 09:03:21.951277: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1009] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-18 09:03:21.952212: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e583bc4480 executing computations on platform CUDA. Devices:
2019-04-18 09:03:21.952225: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Quadro P600, Compute Capability 6.1
2019-04-18 09:03:21.971218: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2904000000 Hz
2019-04-18 09:03:21.971816: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e58577f290 executing computations on platform Host. Devices:
2019-04-18 09:03:21.971842: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-04-18 09:03:21.972102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1551] Found device 0 with properties:
name: Quadro P600 major: 6 minor: 1 memoryClockRate(GHz): 1.5565
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.91GiB
2019-04-18 09:03:21.972147: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1674] Adding visible gpu devices: 0
2019-04-18 09:03:21.972248: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-04-18 09:03:21.973094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1082] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-18 09:03:21.973105: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1088]      0
2019-04-18 09:03:21.973110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1101] 0:   N
2019-04-18 09:03:21.973279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1222] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1735 MB memory) -> physical GPU (device: 0, name: Quadro P600, pci bus id: 0000:01:00.0, compute capability: 6.1)

任何帮助将不胜感激！我即将发布另一个关于在 R 中实现 CUBLAS 的问题，因为与英特尔 MKL 相比，这给了我较慢的速度时间，但我希望也许有一个明确的原因，即使是像 TF 一样构建良好的东西（与hacky R 和 CUBLAS 补丁）在 GPU 上运行缓慢。

谢谢！

编辑：按照弗拉德的建议，我编写了以下脚本来尝试抛出一些大型对象并对其进行训练，但我认为我可能没有正确设置它，因为在这种情况下 CPU 的大小矩阵正在增加。有什么建议吗？

import time
import warnings
import numpy as np
import scipy

import tensorflow as tf
import tensorflow_probability as tfp

from tensorflow_probability import edward2 as ed
from tensorflow.python.ops import control_flow_ops
from tensorflow_probability import distributions as tfd

np.random.seed(69)

# Handy snippet to reset the global graph and global session.
def reset_g():
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        tf.reset_default_graph()
        try:
            sess.close()
        except:
            pass


# Loop over the different number of feature columns
for x_feat in [30, 50, 100, 1000, 10000]:

    y_feat=10;
    # Simulate data
    N = 5000
    inttest = np.ones(N).reshape(N, 1)
    stddev_raw = np.random.uniform(0.01, 0.25, size=y_feat)

    true_int = np.linspace(0.1 ,1., num=y_feat)
    xcols = x_feat
    true_bw = np.random.randn(xcols, y_feat)
    true_X = np.random.randn(N, xcols)
    true_errorcov = np.eye(y_feat)
    np.fill_diagonal(true_errorcov, stddev_raw)

    true_Y = true_int + np.matmul(true_X, true_bw) + \
        np.random.multivariate_normal(mean=np.array([0 for i in range(y_feat)]),
                                      cov=true_errorcov,
                                      size=N)


    ## Our model is:
    ## Y = a + b*X + error where, for N=5000 observations:
    ## Y : 10 outputs;
    ## X : 30,50,100,1000,10000 features
    ## a, b = bias and weights
    ## error: just... error

    # Number of iterations
    N_ITER = 1001

    # Training rate
    lr=0.005

    with tf.device('gpu'):

        # Create data and weights
        model_X = tf.placeholder(dtype=tf.float64, shape=[N, xcols])
        model_Y = tf.placeholder(dtype=tf.float64, shape=[N, y_feat])

        alpha = tf.get_variable(shape=[y_feat], name='alpha', dtype=tf.float64)
        # these two params need shape of one if using trainable distro
        betas = tf.get_variable(shape=[xcols, y_feat], name='beta1', dtype=tf.float64)


        # Yhat
        tf_pred = alpha + tf.matmul(model_X, betas)

        # Make difference of squares (loss fn) [CONVERGES TO TRUTH]
        resid = tf.square(model_Y - tf_pred)
        loss = tf.reduce_sum(resid)

        # Trainer
        opt = tf.train.AdamOptimizer(lr, beta1=0.95, beta2=0.95)
        train = opt.minimize(loss)


    sess = tf.Session()
    sess.run(tf.global_variables_initializer())

    start = time.time()
    for step in range(N_ITER):
        out_l, laws = sess.run([train, loss], feed_dict={model_X: true_X, model_Y: true_Y})

        if step % 500 == 0:
            print('Step: {s}, loss = {l}'.format(
                s=step, l=laws))
    end = time.time()
    print("y_feat: {n}, x_feat: {x2}, Time elapsed: {te}".format(n = y_feat, x2 = x_feat, te = end-start))

    reset_g()

【问题讨论】：

你有多少个 CPU 内核？尝试增加模型中的参数数量。调用 GPU 内核和将数据复制到 GPU 或从 GPU 复制数据时存在开销。对于小型模型，使用 GPU 时不会有性能提升。
我的 CPU 有 8 个内核（Core i7-7700T）。啊，我会尝试更多的功能！谢谢！！
只有当您拥有较大的权重矩阵时，您才会从 GPU 中受益。对于小型矩阵，它在 CPU 上会更快，因为它具有更高的频率，而调用内核和复制的这种开销只会使情况变得更糟。尝试将这些形状的矩阵相乘(10000, 10000)x(10000x10000)。您会看到 GPU 的速度要快得多。

标签： python tensorflow gpu

【解决方案1】：

正如我在评论中所说，调用 GPU 内核以及将数据复制到 GPU 和从 GPU 复制数据的开销非常高。对于参数很少的模型上的操作，使用 GPU 是不值得的，因为 CPU 内核的频率要高得多。如果您比较矩阵乘法（这是 DL 最常做的事情），您会发现对于大型矩阵，GPU 的性能明显优于 CPU。

看看这个情节。 X 轴是两个正方形矩阵的大小，y 轴是在 GPU 和 CPU 上将这些矩阵相乘所需的时间。正如您在一开始所看到的，对于小矩阵，蓝线更高，这意味着它在 CPU 上更快。但是随着我们增加矩阵的大小，使用 GPU 的好处会显着增加。

要重现的代码：

import tensorflow as tf
import time
cpu_times = []
sizes = [1, 10, 100, 500, 1000, 2000, 3000, 4000, 5000, 8000, 10000]
for size in sizes:
    tf.reset_default_graph()
    start = time.time()
    with tf.device('cpu:0'):
        v1 = tf.Variable(tf.random_normal((size, size)))
        v2 = tf.Variable(tf.random_normal((size, size)))
        op = tf.matmul(v1, v2)

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(op)
    cpu_times.append(time.time() - start)
    print('cpu time took: {0:.4f}'.format(time.time() - start))

import tensorflow as tf
import time

gpu_times = []
for size in sizes:
    tf.reset_default_graph()
    start = time.time()
    with tf.device('gpu:0'):
        v1 = tf.Variable(tf.random_normal((size, size)))
        v2 = tf.Variable(tf.random_normal((size, size)))
        op = tf.matmul(v1, v2)

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(op)
    gpu_times.append(time.time() - start)
    print('gpu time took: {0:.4f}'.format(time.time() - start))

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(sizes, gpu_times, label='GPU')
ax.plot(sizes, cpu_times, label='CPU')
plt.xlabel('MATRIX SIZE')
plt.ylabel('TIME (sec)')
plt.legend()
plt.show()

【讨论】：

非常感谢 Vlad 提供的代码！这是如何工作的非常有趣！
谢谢！我编辑了我的原始帖子以添加我编写的新脚本，以尝试使用 GPU 训练我的回归模型。我试图增加每次循环迭代的维度，但仍然没有像你写的例子那样获得收益@Vlad。有什么想法..？

【解决方案2】：

使用tf.device()选择您的设备

with tf.device('/cpu:0'):
    #enter code here of tf data

在一个典型的系统上，有多个计算设备。在 TensorFlow 中，支持的设备类型是 CPU 和 GPU。它们表示为字符串。例如：

"/cpu:0": The CPU of your machine.
"/device:GPU:0": The GPU of your machine, if you have one.
"/device:GPU:1": The second GPU of your machine, etc.

GPU：

with tf.device('/device:GPU:0'):
  #code here: tf data and model

参考：Link

【讨论】：

【解决方案3】：

从 tensorflow 2.0+ 的第一条评论中重现答案

tf.compat.v1.disable_eager_execution()
cpu_times = []
sizes = [1, 10, 100, 500, 1000, 2000, 3000, 4000, 5000, 8000, 10000]
for size in sizes:
    ops.reset_default_graph()
    start = time.time()
    with tf.device('cpu:0'):
        v1 = tf.Variable(tf.random.normal((size, size)))
        v2 = tf.Variable(tf.random.normal((size, size)))
        op = tf.matmul(v1, v2)

    with tf.compat.v1.Session() as sess:
        sess.run(tf.compat.v1.global_variables_initializer())
        sess.run(op)
    cpu_times.append(time.time() - start)
    print('cpu time took: {0:.4f}'.format(time.time() - start))

import tensorflow as tf
import time

gpu_times = []
for size in sizes:
    ops.reset_default_graph()
    start = time.time()
    with tf.device('gpu:0'):
        v1 = tf.Variable(tf.random.normal((size, size)))
        v2 = tf.Variable(tf.random.normal((size, size)))
        op = tf.matmul(v1, v2)

    with tf.compat.v1.Session() as sess:
        sess.run(tf.compat.v1.global_variables_initializer())
        sess.run(op)
    gpu_times.append(time.time() - start)
    print('gpu time took: {0:.4f}'.format(time.time() - start))

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(sizes, gpu_times, label='GPU')
ax.plot(sizes, cpu_times, label='CPU')
plt.xlabel('MATRIX SIZE')
plt.ylabel('TIME (sec)')
plt.legend()
plt.show()

【讨论】：