【发布时间】:2018-05-22 11:36:54
【问题描述】:
我正在使用基于this article 的代码来查看 GPU 加速,但我只能看到减速:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import sys
if len(sys.argv) != 3:
exit("Usage: " + sys.argv[0] + " [cuda|cpu] N(100000-11500000)")
@vectorize(["float32(float32, float32)"], target=sys.argv[1])
def VectorAdd(a, b):
return a + b
def main():
N = int(sys.argv[2])
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
elapsed_time = timer() - start
#print("C[:5] = " + str(C[:5]))
#print("C[-5:] = " + str(C[-5:]))
print("Time: {}".format(elapsed_time))
main()
结果:
$ python speed.py cpu 100000
Time: 0.0001056949986377731
$ python speed.py cuda 100000
Time: 0.11871792199963238
$ python speed.py cpu 11500000
Time: 0.013704434997634962
$ python speed.py cuda 11500000
Time: 0.47120747699955245
我无法发送更大的向量,因为这会产生numba.cuda.cudadrv.driver.CudaAPIError: Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE 异常。`
nvidia-smi 的输出是
Fri Dec 8 10:36:19 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.98 Driver Version: 384.98 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro 2000D Off | 00000000:01:00.0 On | N/A |
| 30% 36C P12 N/A / N/A | 184MiB / 959MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 933 G /usr/lib/xorg/Xorg 94MiB |
| 0 985 G /usr/bin/gnome-shell 86MiB |
+-----------------------------------------------------------------------------+
CPU详情
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 58
Model name: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
Stepping: 9
CPU MHz: 3300.135
CPU max MHz: 3700.0000
CPU min MHz: 1600.0000
BogoMIPS: 6600.27
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
GPU 是 Nvidia Quadro 2000D,具有 192 个 CUDA 内核和 1Gb RAM。
更复杂的操作:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
import sys
if len(sys.argv) != 3:
exit("Usage: " + sys.argv[0] + " [cuda|cpu] N()")
@vectorize(["float32(float32, float32)"], target=sys.argv[1])
def VectorAdd(a, b):
return a * b
def main():
N = int(sys.argv[2])
A = np.zeros((N, N), dtype='f')
B = np.zeros((N, N), dtype='f')
A[:] = np.random.randn(*A.shape)
B[:] = np.random.randn(*B.shape)
start = timer()
C = VectorAdd(A, B)
elapsed_time = timer() - start
print("Time: {}".format(elapsed_time))
main()
结果:
$ python complex.py cpu 3000
Time: 0.010573603001830634
$ python complex.py cuda 3000
Time: 0.3956961739968392
$ python complex.py cpu 30
Time: 9.693001629784703e-06
$ python complex.py cuda 30
Time: 0.10848476299725007
知道为什么吗?
【问题讨论】:
-
您目前正在支付迁移到 GPU 的成本,这与具有 L1、L2 缓存的 CPU 相比非常慢。为了充分利用 GPU 的强大功能,您希望将更大的块发送到 GPU。
-
即使阵列大 100 倍,CPU 仍然更快,尽管相对速度要小得多。
-
这里唯一的问题是您不切实际的性能期望。您的示例完全受内存和延迟限制,并且您在 非常 适度的 GPU 上运行。您可能需要多 6 到 10 个数量级的浮点运算才能看到有用的加速
-
@szabgab Nvidia 是否声称在矩阵加法示例中提高了性能?如前所述,GPU 配置在这里很重要。您可以通过归约运算、三角运算等来增加计算/内存比率。
-
@szabgab:我认为您应该重新阅读您引用的那篇博客文章。在任何阶段,它是否声称与普通 Python 相比,该示例产生了性能改进?为您提供性能测试示例不是我的工作。你问出了什么问题,简单的回答是没有,