【问题标题】:Tensorflow CUDA_ERROR_UNKNOWN on Google Cloud Platform谷歌云平台上的 TensorFlow CUDA_ERROR_UNKNOWN
【发布时间】:2021-07-05 17:22:58
【问题描述】:

我使用带有 Tesla A100 GPU、TensorFlow Enterprise 2.5 和 CUDA 11.0 的深度学习 VM 部署了一个虚拟机。但我无法访问 GPU/CUDA 并收到以下错误。

E tensorflow/stream_executor/cuda/cuda_driver.cc:328] 调用失败 cuInit: CUDA_ERROR_UNKNOWN: 未知错误

在部署时,我收到了这个警告:

tensorflow 有资源级别警告。 资源 'projects/click-to-deploy-images/global/images/tf-2-5-cu110-v20210619-debian-10' 已弃用。建议的替换是“projects/click-to-deploy-images/global/images/tf-2-5-cu110-v20210624-debian-10”。

这是google生成的已经存在的图像,很多人都在使用它,但是为什么我无法使用它访问GPU或CUDA?

import tensorflow as tf
2021-07-05 17:05:14.901743: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
tf.__version__
'2.5.0'
print(tf.config.list_physical_devices())
2021-07-05 17:05:44.757638: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-05 17:05:44.840142: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-07-05 17:05:44.840245: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: deeplearning-1-vm
2021-07-05 17:05:44.840258: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: deeplearning-1-vm
2021-07-05 17:05:44.841760: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 450.80.2
2021-07-05 17:05:44.841820: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 450.80.2
2021-07-05 17:05:44.841833: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 450.80.2
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

以下细节可以帮助找出问题。

a_k@deeplearning-1-vm:~$ nvidia-smi
Mon Jul  5 17:03:43 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    56W / 400W |      0MiB / 40537MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
a_k@deeplearning-1-vm:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0

a_k@deeplearning-1-vm:~$ cat /usr/local/cuda/version.txt
CUDA Version 11.0.207

【问题讨论】:

  • this 可能感兴趣
  • 我尝试了多个由 t​​f2.5、tf2.4 和 tf2.3 提供的由谷歌云提供的预构建深度学习 VM 实例。它们都不适合我。在这些预构建实例上重新安装 tensorflow 或 cuda 会产生安装/依赖项/权限错误。 This 解决方案对我不起作用。

标签: tensorflow google-cloud-platform google-dl-platform


【解决方案1】:

问题是google云平台提供的所有预建实例上的nvidia驱动、cuda和tensorflow版本不兼容(tf2.5需要cuda>=11.2)。我通过在预构建实例(tensorflow enterprise 2.5、CUDA 11.0)上重新安装最新版本的 CUDA 解决了这个问题,现在即使在重新启动实例后它也能正常工作。 Google 必须更新他们的预构建 vm 实例才能解决

This 讨论帮助我找到了解决方案。 为了重新安装 CUDA,我没有卸载任何东西,只是完全按照these 6 说明进行操作(适用于 debian 10)。虽然,我有 Ubuntu 18.4,但它仍然有效。它还会询问您是否要卸载以前的 cuda 版本(是!)。

现在,我有以下内容

a_k@a100-tfe25-vm:~$ nvidia-smi
Tue Jul  6 09:56:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    52W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

a_k@a100-tfe25-vm:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jun__2_19:15:15_PDT_2021
Cuda compilation tools, release 11.4, V11.4.48
Build cuda_11.4.r11.4/compiler.30033411_0

a_k@a100-tfe25-vm:~$ python3
Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-07-06 09:57:08.277452: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> tf.__version__
'2.5.0'
>>> tf.config.list_physical_devices()
2021-07-06 09:57:30.897584: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-06 09:57:31.689883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: NVIDIA A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-07-06 09:57:31.689997: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-06 09:57:31.696712: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-06 09:57:31.696809: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-07-06 09:57:31.699051: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-07-06 09:57:31.699981: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-07-06 09:57:31.734585: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2021-07-06 09:57:31.735833: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-07-06 09:57:31.738230: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-07-06 09:57:31.743485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

【讨论】:

    【解决方案2】:

    通过fix provided in this Google Cloud Platform public forum,我们可以通过以下方式缓解该问题:

    • 修复 #1:在新的 VM 实例中使用最新的 DLVM 映像(M74 或更高版本):他们已针对 M74 中的最新 DLVM 映像发布了修复,因此您将不再受此问题的影响。
    • 修复 #2 修补运行早于 M74 的现有实例的映像:

    在受影响的实例上通过 SSH 会话运行以下命令:

    gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh
    
    chmod +x /tmp/restart_patch.sh
    
    sudo /tmp/restart_patch.sh
    
    sudo service jupyter restart
    

    只需要执行一次,不需要每次重启实例都重新运行。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-02-27
      • 2020-06-02
      • 2020-09-04
      • 1970-01-01
      • 2015-02-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多