为什么使用 GPU 比使用 CPU 慢？ [复制]答案

【问题标题】：Why is using the GPU slower than using the CPU? [duplicate]为什么使用 GPU 比使用 CPU 慢？ [复制]
【发布时间】：2021-05-11 14:51:04
【问题描述】：

考虑以下网络：

%%time
import torch
from torch.autograd import grad
import torch.nn as nn
import torch.optim as optim

class net_x(nn.Module): 
        def __init__(self):
            super(net_x, self).__init__()
            self.fc1=nn.Linear(1, 20) 
            self.fc2=nn.Linear(20, 20)
            self.out=nn.Linear(20, 400) #a,b,c,d

        def forward(self, x):
            x=torch.tanh(self.fc1(x))
            x=torch.tanh(self.fc2(x))
            x=self.out(x)
            return x

nx = net_x()

#input
val = 100
t = torch.rand(val, requires_grad = True) #input vector
t = torch.reshape(t, (val,1)) #reshape for batch

#method 
dx = torch.autograd.functional.jacobian(lambda t_: nx(t_), t)

这个输出

CPU times: user 11.1 s, sys: 3.52 ms, total: 11.1 s
Wall time: 11.1 s

但是，当我改用 .to(device) 的 GPU 时：

%%time
import torch
from torch.autograd import grad
import torch.nn as nn
import torch.optim as optim
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class net_x(nn.Module): 
        def __init__(self):
            super(net_x, self).__init__()
            self.fc1=nn.Linear(1, 20) 
            self.fc2=nn.Linear(20, 20)
            self.out=nn.Linear(20, 400) #a,b,c,d

        def forward(self, x):
            x=torch.tanh(self.fc1(x))
            x=torch.tanh(self.fc2(x))
            x=self.out(x)
            return x

nx = net_x()
nx.to(device)
#input
val = 100
t = torch.rand(val, requires_grad = True) #input vector
t = torch.reshape(t, (val,1)).to(device) #reshape for batch

#method 
dx = torch.autograd.functional.jacobian(lambda t_: nx(t_), t)

这个输出：

CPU times: user 18.6 s, sys: 1.5 s, total: 20.1 s
Wall time: 19.5 s

更新 1： 检查将输入和模型移动到设备的过程的时间：

%%time
nx.to(device)
t.to(device)

这个输出：

CPU times: user 2.05 ms, sys: 0 ns, total: 2.05 ms
Wall time: 2.13 ms

更新 2： 看起来@Gulzar 是对的。我将批量大小更改为 1000 (val=1000)，CPU 输出： Wall time: 8min 44s GPU 输出时： Wall time: 3min 12s

【问题讨论】：

不确定它是否适用于此，但可能是由于将数据从 CPU 复制到 GPU 的成本增加。也许您也可以单独检查复制步骤所花费的时间。一般来说，GPU 的好处超过了这种大规模的复制成本。
好点。我检查了（查看我的更新）
您的批量大小是多少？数据非常小，因此对于小批量的 GPU 并行化可能没有太大好处。
@Gulzar 是的，你是对的。查看我的更新 2
这个问题已经被问了上百次了（见stackoverflow.com/search?q=gpu+slow+cpu），没有必要对这个话题提出新的问题。我最喜欢的答案在这里：stackoverflow.com/questions/55749899/…

标签： python parallel-processing neural-network pytorch gpu

【解决方案1】：

挥手回答

GPU 是“较弱”的计算机，其计算核心比 CPU 多得多。
数据必须每隔一段时间以一种“昂贵”的方式从 RAM 内存传递到 GRAM，以便他们可以处理它。

如果数据“大”，并且可以对该数据进行并行处理，那么计算速度可能会更快。

如果数据“不够大”，则传输数据的成本，或者使用较弱的内核和同步它们的成本可能会超过并行化的好处。

GPU 什么时候有用？

适用于较大的网络，或较重的计算，例如卷积，或较大的全连接层（较大的矩阵乘法）
对于较大的批次 - 批次是并行计算的一种非常简单的方法，因为它们（几乎*）是独立的。 *几乎，因为它们确实需要在某些时候以编程方式同步。

【讨论】：