为什么在 Pytorch 中打印 GPU 张量的值需要这么长时间？答案

【问题标题】：Why does it take so long print the value of a GPU tensor in Pytorch?为什么在 Pytorch 中打印 GPU 张量的值需要这么长时间？
【发布时间】：2021-06-03 05:43:20
【问题描述】：

我编写了这个 pytorch 程序来计算 GPU 上 5000*5000 的矩阵乘法，100 次迭代。

import torch
import numpy as np
import time

N = 5000
x1 = np.random.rand(N, N)

######## a 5000*5000 matrix multiplication on GPU, 100 iterations #######
x2 = torch.tensor(x1, dtype=torch.float32).to("cuda:0")

start_time = time.time()
for n in range(100):
    G2 = x2.t() @ x2
print(G2.size())
print("It takes", time.time() - start_time, "seconds to compute")
print("G2.device:", G2.device)

start_time2 = time.time()
# G4 = torch.zeros((5,5),device="cuda:0")
G4 = G2[:5, :5]
print("G4.device:", G4.device)
print("G4======", G4)
# G5=G4.cpu()
# print("G5.device:",G5.device)
print("It takes", time.time() - start_time2, "seconds to transfer or display")

这是我笔记本电脑上的结果：

torch.Size([5000, 5000])

计算需要 0.22243595123291016 秒

G2.device: cuda:0

G4.device: cuda:0

G4======张量([[1636.3195, 1227.1913, 1252.6871, 1242.4584, 1235.8160], [1227.1913、1653.0522、1260.2621、1246.9526、1250.2871]、 [1252.6871, 1260.2621, 1685.1147, 1257.2373, 1266.2213], [1242.4584, 1246.9526, 1257.2373, 1660.5951, 1239.5414], [1235.8160, 1250.2871, 1266.2213, 1239.5414, 1670.0034]], 设备='cuda:0')

传输或显示需要60.13639569282532秒

进程以退出代码 0 结束

我很困惑为什么在 GPU 上显示变量 G5 需要这么多时间，因为它的大小只有 5*5。顺便说一句，我使用“G5=G4.cpu()”将GPU上的变量传输到CPU，也需要很多时间。

我的开发环境（相当旧的笔记本电脑）：

pytorch 1.0.0
CUDA 8.0
英伟达 GeForce GT 730m
Windows 10 专业版

增加迭代次数时，计算时间没有明显增加，但传输或显示明显增加，为什么？谁能解释一下，非常感谢。

【问题讨论】：

看看discuss.pytorch.org/t/copy-tensor-from-cuda-to-cpu-is-too-slow/…

标签： pytorch nvidia

【解决方案1】：

Pytorch CUDA 操作是异步的。在请求派生结果之前，GPU 张量上的大多数操作实际上都是非阻塞的。这意味着，在您要求张量的 CPU 版本之前，矩阵乘法之类的命令基本上是与您的代码并行处理的。当您停止计时器时，无法保证操作已完成。你可以阅读更多关于这个in the docs的信息。

要正确计时代码块，您应该添加对torch.cuda.synchronize 的调用。这个函数应该被调用两次，一次是在你启动定时器之前，一次是在你停止定时器之前。在分析代码之外，您应该避免调用此函数，因为它可能会降低整体性能。

【讨论】：

这是否意味着在我的代码循环中计算尚未完成，直到显示结果或将值从 GPU 复制到 CPU 才完成，
是的，就是这个意思。