GKE - 无法使 cuda 与 pytorch 一起工作答案

【问题标题】：GKE - Unable to make cuda work with pytorchGKE - 无法使 cuda 与 pytorch 一起工作
【发布时间】：2019-11-14 01:07:19
【问题描述】：

我已经使用 nvidia tesla k80 设置了一个 kubernetes 节点，并按照this tutorial 尝试运行一个 pytorch docker 映像，其中 nvidia 驱动程序和 cuda 驱动程序工作。

我的 nvidia 驱动程序和 cuda 驱动程序都可以在我的 pod 中通过/usr/local 访问：

$> ls /usr/local
bin  cuda  cuda-10.0  etc  games  include  lib  man  nvidia  sbin  share  src

而且我的 GPU 也被我的图像识别了nvidia/cuda:10.0-runtime-ubuntu18.04：

$> /usr/local/nvidia/bin/nvidia-smi
Fri Nov  8 16:24:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P8    35W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

但在安装 pytorch 1.3.0 后，即使将 LD_LIBRARY_PATH 设置为 /usr/local/nvidia/lib64:/usr/local/cuda/lib64，我也无法让 pytorch 识别我的 cuda 安装：

$> python3 -c "import torch; print(torch.cuda.is_available())"
False

$> python3
Python 3.6.8 (default, Oct  7 2019, 12:59:55)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print ('\t\ttorch.cuda.current_device()    =', torch.cuda.current_device())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 386, in current_device
    _lazy_init()
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 192, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 111, in _check_driver
    of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
AssertionError:
The NVIDIA driver on your system is too old (found version 10000).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.

上面的错误很奇怪，因为我的图像的 cuda 版本是 10.0 并且 Google GKE 提到：

支持的最新 CUDA 版本是 10.0

此外，自动安装 NVIDIA 驱动程序的是 GKE 的守护进程

将 GPU 节点添加到集群后，您需要在节点上安装 NVIDIA 的设备驱动程序。

Google 提供了一个 DaemonSet，它会自动为您安装驱动程序。有关 Container-Optimized OS (COS) 和 Ubuntu 节点的安装说明，请参阅以下部分。

要部署安装 DaemonSet，请运行以下命令： kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

我已经尝试了我能想到的一切，但没有成功......

【问题讨论】：

是在本地机器上使用 docker run 还是在使用 GPU 的独立 GCE VM 上运行相同的容器（假设您在本地有 nvidia 硬件）？

标签： kubernetes google-cloud-platform pytorch google-kubernetes-engine

【解决方案1】：

我通过从 pytorch/pytorch:1.2-cuda10.0-cudnn7-devel 构建我的 docker 映像来降级我的 pytorch 版本解决了我的问题。

我仍然不知道为什么在它不能正常工作之前，然后猜测pytorch 1.3.0 与cuda 10.0 不兼容。

【讨论】：