如何在离散图形 AMD GPU 上运行 Python 脚本？答案

【问题标题】：How to run Python script on a Discrete Graphics AMD GPU?如何在离散图形 AMD GPU 上运行 Python 脚本？
【发布时间】：2021-01-13 14:08:59
【问题描述】：

我想做什么：

我有一个脚本，用于在给定范围内分解素数：

# Python program to display all the prime numbers within an interval

lower = 900
upper = 1000

print("Prime numbers between", lower, "and", upper, "are:")

for num in range(lower, upper + 1):
   # all prime numbers are greater than 1
   if num > 1:
       for i in range(2, num):
           if (num % i) == 0:
               break
       else:
           print(num)

我想使用 GPU 而不是 CPU 来运行这样的脚本，这样会更快

问题：

我的Intel NUC NUC8i7HVK 上没有 NVIDIA GPU，但"Discrete GPU" 上没有 NVIDIA GPU

如果我运行此代码来检查我的 GPU 是什么：

import pyopencl as cl
import numpy as np

a = np.arange(32).astype(np.float32)
res = np.empty_like(a)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes)

prg = cl.Program(ctx, """
    __kernel void sq(__global const float *a,
    __global float *c)
    {
      int gid = get_global_id(0);
      c[gid] = a[gid] * a[gid];
    }
    """).build()

prg.sq(queue, a.shape, None, a_buf, dest_buf)

cl.enqueue_copy(queue, res, dest_buf)

print (a, res)

我收到：

[0] <pyopencl.Platform 'AMD Accelerated Parallel Processing' at 0x7ffb3d492fd0>
[1] <pyopencl.Platform 'Intel(R) OpenCL HD Graphics' at 0x187b648ed80>

解决问题的可能方法：

我找到了一个guide，它可以帮助您逐步解释如何在您的 GPU 上运行它。但是所有通过 GPU 传输 Python 的 Pyhton 库，例如 PyOpenGL、PyOpenCL、Tensorflow (Force python script on GPU)、PyTorch 等......都是为 NVIDIA 量身定制的。

如果您有 AMD，所有库都要求ROCm，但据我所知，此类软件仍然不支持集成 GPU 或离散 GPU（请参阅下面我自己的回复）。

我只找到了一个guide 谈论这种方法，但我无法让它发挥作用。

是有希望还是我只是想做一些不可能的事情？

编辑：回复@chapelo

如果我选择0，回复是：

Set the environment variable PYOPENCL_CTX='0' to avoid being asked again.
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [  0.   1.   4.   9.  16.  25.  36.  49.  64.  81. 100. 121. 144. 169.
 196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729.
 784. 841. 900. 961.]

如果我选择1，回复是：

Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.
 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [  0.   1.   4.   9.  16.  25.  36.  49.  64.  81. 100. 121. 144. 169.
 196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729.
 784. 841. 900. 961.]

【问题讨论】：

你看documentation of numba for AMD Gpu using RocM了吗？
@Lescurel 谢谢，这看起来是一个很好的起点。显然只适用于Linux，但我可以解决它。我去看看。
@Lescurel，我现在正在关注本指南shawonashraf.github.io/rocm-tf-ubuntu。顺便说一句，我在英特尔 NUC8i7HVK ark.intel.com/content/www/us/en/ark/products/126143/… 上做这一切，它实际上没有 GPU，而是有两个“集成”GPU。我是在浪费时间，还是 Tensorflow、PyOpenGL 等...也可以与集成 GPU 一起使用？
我不知道，我从未尝试过。祝你好运！
我有同样的问题，基本上我不得不去Nvidia。

标签： python tensorflow pytorch pyopengl pyopencl

【解决方案1】：

经过大量研究和多次尝试，我得出了结论：

PyOpenGL：主要适用于 NVIDIA。如果您有 AMD GPU，则需要安装 ROCm
PyOpenCL：主要适用于 NVIDIA。如果您有 AMD GPU，则需要安装 ROCm
TensorFlow：主要与 NVIDIA 合作。如果您有 AMD GPU，则需要安装 ROCm
PyTorch：主要与 NVIDIA 合作。如果您有 AMD GPU，则需要安装 ROCm

我安装了 ROCm，但如果我运行 rocminfo 它会返回：

ROCk module is NOT loaded, possibly no GPU devices
Unable to open /dev/kfd read-write: No such file or directory
Failed to get user name to check for video group membership
hsa api call failure at: /src/rocminfo/rocminfo.cc:1142
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

clinfo 返回：

Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.0 AMD-APP (3212.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_event_callback
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              No platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform

rocm-smi 返回：

Segmentation fault

这是因为在 official guide 中它说 “Ryzen 的集成 GPU 不是 ROCm 的官方支持目标。” 因为我的是集成 GPU，所以我不在范围内。

我将停止浪费我的时间，可能会购买 NVIDIA 或 AMD eGPU（外部 GPU）

【讨论】：

你的 GPU 有 ICD 吗？
@chapelo 我在 Linux 中有文件 C:\Windows\System32\OpenCL.dll 和 /etc/OpenCL/vendors/amdocl64.icd
@FrancescoMantovani 我相信您对此进行了广泛的研究。但是，我不确定这是否应该被视为一个答案，尤其是当您说“我将停止浪费我的时间并可能购买 NVIDIA 或 AMD eGPU（外部 GPU）”时。我建议您改为对帖子进行编辑。谢谢！
@AndrewNaguib，是的，当然，我正在等待 chapelo 的回答，我认为这是一个很好的解决方案。我会测试它，我可能会删除我的答案。

【解决方案2】：

pyopencl 确实适用于您的 AMD 和 Intel GPU。并且您检查了您的安装是否正常工作。只需将您的环境变量 PYOPENCL_CTX='0' 设置为每次都使用 AMD 而不会被询问。

或者不使用ctx = cl.create_some_context()，您可以使用以下方法在程序中定义上下文：

platforms = cl.get_platforms()
ctx = cl.Context(
   dev_type=cl.device_type.ALL,
   properties=[(cl.context_properties.PLATFORM, platforms[0])])

不要理所当然地认为您的 AMD 在每种情况下都优于您的 Intel。我曾经遇到过英特尔超过另一个的案例。我认为这与将 CPU 外的数据复制到另一个 GPU 的成本有关。

话虽如此，我认为与拥有更好的算法相比，并行运行脚本不会有太大的改进：

使用筛分算法，将质数直至上数的平方根。
应用类似的筛选算法，使用上一步中的素数从下限到上限筛选数字。

也许这不是一个可以轻松并行运行的算法的好例子，但你们都准备好尝试另一个例子了。

但是，为了向您展示如何使用 GPU 解决此问题，请考虑以下更改：

串行算法如下所示：

from math import sqrt

def primes_below(number):
    n = lambda a: 2 if a==0 else 2*a + 1
    limit = int(sqrt(number)) + 1
    size = number//2
    primes = [True] * size
    for i in range(1, size):
        if primes[i]:
            num = n(i)
            for j in range(i+num, size, num):
                primes[j] = False
    for i, flag in enumerate(primes):
        if flag: yield n(i)

def primes_between(lo, hi):
    primes = list(primes_below(int(sqrt(hi))+1))
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    n = lambda a: 2*a + lo + (0 if lo%2 else 1)
    numbers = [True]*size
    for i, prime in enumerate(primes):
        if i == 0: continue
        start = 0
        while (n(start)%prime) != 0: 
            start += 1
        for j in range(start, size, prime):
            numbers[j] = False
    for i, flag in enumerate(numbers):
        if flag: yield n(i)

这会在 0.64 秒内打印 1e6 和 5e6 之间的素数列表

尝试在我的 GPU 上使用您的脚本没有超过 5 分钟。对于一个小 10 倍的问题：1e5 和 5e5 之间的素数，大约需要 29 秒。

修改脚本，以便 GPU 中的每个进程将一个奇数（没有必要测试偶数）除以预先计算的素数列表，直到上数的平方根，如果素数大于则停止数字本身的平方根，它在 0.50 秒内完成相同的任务。这是一个进步！

代码如下：

import numpy as np
import pyopencl as cl
import pyopencl.algorithm
import pyopencl.array

def primes_between_using_cl(lo, hi):

    primes = list(primes_below(int(sqrt(hi))+1))

    numbers_h = np.arange(  lo + (0 if lo&1 else 1), 
                            hi + (0 if hi&1 else 1),
                            2,
                            dtype=np.int32)

    size = (hi - lo - (0 if hi%2 else 1))//2 + 1

    code = """\
    __kernel 
    void is_prime( __global const int *primes,
                   __global       int *numbers) {
      int gid = get_global_id(0);
      int num = numbers[gid];
      int max = (int) (sqrt((float)num) + 1.0);
      for (; *primes; ++primes) {
   
        if (*primes <= max && num % *primes == 0) {
          numbers[gid] = 0;
          return;
        }
      }
    }
    """

    platforms = cl.get_platforms()
    ctx = cl.Context(dev_type=cl.device_type.ALL,
       properties=[(cl.context_properties.PLATFORM, platforms[0])])     
    queue = cl.CommandQueue(ctx)
    prg = cl.Program(ctx, code).build()
    numbers_d = cl.array.to_device(queue, numbers_h)

    primes_d = cl.array.to_device(queue,
                                  np.array(primes[1:], # don't need 2
                                  dtype=np.int32))

    prg.is_prime(queue, (size, ), None, primes_d.data, numbers_d.data)

    array, length = cl.algorithm.copy_if(numbers_d, "ary[i]>0")[:2]

    yield from array.get()[:length.get()]

【讨论】：

检查我所做的编辑，我在最后发布了一个示例。尝试运行它。
您的代码看起来很有希望，但两个脚本都没有返回任何内容 snipboard.io/M7N1Dy.jpg 。也许是因为其中实际上没有print()？另外如何设置hi 和lo？
我给你的函数是生成器。我猜你必须在你的程序中使用它们以及在每种情况下相关的输入值，并且输出将用于有用的东西。仅仅打印结果很少。使用 hi 和 lo 的值（如您的下限值和上限值）从您的程序中调用该函数，并一个一个地使用结果，或者将它们放在一个列表中。我以为你知道这些简单的事情。
我的知识肯定比你低，比如我第一次看到yield，之前不知道它的存在。我只想在终端上打印结果。如果我用return 替换yield from，它会显示NameError: name 'primes_below' is not defined。如何打印从 1 到 1000 的数字？我只是想检查您的解决方案是否有效。谢谢
我会将我的回复作为另一个答案发布，以保持清晰和重点。

【解决方案3】：

以下代码是一个完整的python程序示例，通常包括：

导入语句
函数定义
main() 函数
if __name__ == "__main__": 部分。

我希望这可以帮助您解决问题。

import pyprimes
from math import sqrt
import numpy as np

import pyopencl as cl
import pyopencl.algorithm
import pyopencl.array

def primes_below(number):
    """Generate a list of prime numbers below a specified  `number`"""
    n = lambda a: 2 if a==0 else 2*a + 1
    limit = int(sqrt(number)) + 1
    size = number//2
    primes = [True] * size
    for i in range(1, size):
        if primes[i]:
            num = n(i)
            if num > limit: break
            for j in range(i+num, size, num):
                primes[j] = False
    for i, flag in enumerate(primes):
        if flag:
            yield n(i)

def primes_between(lo, hi):
    """Generate a list of prime numbers betwenn `lo` and `hi` numbers"""
    primes = list(primes_below(int(sqrt(hi))+1))
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    n = lambda a: 2*a + lo + (0 if lo%2 else 1)
    numbers = [True]*size
    for i, prime in enumerate(primes):
        if i == 0: continue # avoid dividing by 2
        nlo = n(0)
        # slower # start = prime * (nlo//prime + 1) if nlo%prime else 0
        start = 0
        while (n(start)%prime) != 0: 
            start += 1
        for j in range(start, size, prime):
            numbers[j] = False
    for i, flag in enumerate(numbers):
        if flag:
            yield n(i)

def primes_between_using_cl(lo, hi):
    """Generate a list of prime numbers betwenn a lo and hi numbers
    this is a parallel algorithm using pyopencl"""
    primes = list(primes_below(int(sqrt(hi))+1))
    size_primes_h = np.array( (len(primes)-1, ), dtype=np.int32)
    numbers_h = np.arange(  lo + (0 if lo&1 else 1), 
                                  hi + (0 if hi&1 else 1),
                                  2,
                                  dtype=np.int32)
    size = (hi - lo - (0 if hi%2 else 1))//2 + 1
    code = """\
    __kernel 
    void is_prime( __global const int *primes,
                        __global         int *numbers) {
      int gid = get_global_id(0);
      int num = numbers[gid];
      int max = (int) (sqrt((float)num) + 1.0);
      for (; *primes; ++primes) {
         if (*primes > max) break;
         if (num % *primes == 0) {
            numbers[gid] = 0;
            return;
         }
      }
    }
    """
    platforms = cl.get_platforms()
    ctx = cl.Context(dev_type=cl.device_type.ALL,
        properties=[(cl.context_properties.PLATFORM, platforms[0])])
    queue = cl.CommandQueue(ctx)
    prg = cl.Program(ctx, code).build()
    numbers_d = cl.array.to_device(queue, numbers_h)
    primes_d = cl.array.to_device(queue, np.array(primes[1:], dtype=np.int32))
    prg.is_prime(queue, (size, ), None, primes_d.data, numbers_d.data)
    array, length = cl.algorithm.copy_if(numbers_d, "ary[i]>0")[:2]
    yield from array.get()[:length.get()]

def test(f, lo, hi):
    """Test that all prime numbers are generated by comparing with the
    output of the library `pyprimes`"""
    a = filter(lambda p: p>lo, pyprimes.primes_below(hi))
    b = f(lo, hi)
    result = True
    for p, q in zip (a, b):
        if p != q:
            print(p, q)
            result = False
    return result
    
def main():
    lower = 1000
    upper = 5000
    print("The prime numbers between {} and {}, are:".format(lower,upper))
    print()
    for p in primes_between_using_cl(lower, upper):
        print(p, end=' ')
    print()

if __name__ == '__main__':
    main()

【讨论】：

非常感谢@chapelo。但是我仍然有很多疑问：我可以设置 platforms[0] 或 platforms[0] 但是脚本始终使用 Radeon，它从不使用 Intel。另外：脚本只占用 10% 的 GPU，不超过：snipboard.io/sNqzWk.jpg