【问题标题】:Can't get tensorflow-gpu to work in R due to CUDA issues由于 CUDA 问题,无法让 tensorflow-gpu 在 R 中工作
【发布时间】:2021-01-14 14:13:28
【问题描述】:

我正在尝试开始使用 Keras,并且我拥有一种新型的 Nvidia GPU,但尽管我使用的是全新安装的 Ubuntu,但我似乎无法启动它( 20.04)。

在我第一次尝试时,我注意到 Ubuntu 检测到了我的显卡,所以我通过进入“附加驱动程序”来安装它。然后我使用以下命令安装了 Keras 和 Tensorflow,并且没有产生任何错误。

install.packages("keras")
library(keras)

install_keras(tensorflow = "gpu")

但是,当我尝试实际设置 Keras 模型时,

model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

我收到这条可怕的错误消息:

2021-01-14 09:04:53.188680: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-14 09:04:53.189214: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-14 09:04:53.224466: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-14 09:04:53.224843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:09:00.0 name: GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.785GHz coreCount: 68 deviceMemorySize: 9.78GiB deviceMemoryBandwidth: 707.88GiB/s
2021-01-14 09:04:53.224860: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-14 09:04:53.226413: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-14 09:04:53.226446: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-14 09:04:53.226935: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-14 09:04:53.227061: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-14 09:04:53.227139: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/arta/.local/share/r-miniconda/envs/r-reticulate/lib:/usr/lib/R/lib:/usr/local/cuda-11.2/lib64:::/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server:/usr/local/cuda-11.2/lib64
2021-01-14 09:04:53.227437: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-01-14 09:04:53.227513: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-14 09:04:53.227519: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-01-14 09:04:53.228275: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-14 09:04:53.228290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-14 09:04:53.228293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]   

您可能会注意到,此错误消息提到了cuda-11.2,但是,当我使用系统默认的cuda-10.1 时,我收到了几乎相同的错误消息,我想这是驱动程序附带的。

我做了很多事情,包括直接从 Nvidia 的网站下载并尝试使用他们的文档安装 cuDNN,并将 cuda 添加到 PATHLD_LIBRARY_PATH,但无济于事。

最后,我删除了我的 r-reticulate conda 环境,这样我就可以从头开始重新安装 Tensorflow,但使用 cuda 11.2 而不是默认的 10.1。

我按照this blog post 上的说明进行操作,但我将 10.1 的每个实例替换为 11.2,并将 libcudnn.so.7 替换为 libcudnn.so.8,因为这是可用的最新版本,也是我下载到系统中的版本,这让我看到了上面的错误消息,这与我使用 10.1 时得到的几乎相同,这是我的计算机默认设置的。

另外,当我再次尝试在 R 中使用 Tensorflow 时,我注意到一些奇怪的事情。我使用install_keras(tensorflow = "gpu") 安装它,没有明显问题,但是当我调用以下命令时:

imdb <- dataset_imdb(num_words = 10000)

它再次开始为我下载和安装它,但它给了我这个警告:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-gpu 2.2.0 requires tensorboard<2.3.0,>=2.2.0, but you have tensorboard 2.4.0 which is incompatible.
tensorflow-gpu 2.2.0 requires tensorflow-estimator<2.3.0,>=2.2.0, but you have tensorflow-estimator 2.4.0 which is incompatible.

我该怎么做?为什么它可以使用正确的CUDA安装:

2021-01-14 09:00:06.766462: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

但它不能在其他地方使用另一个文件?

2021-01-14 09:04:53.227139: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/arta/.local/share/r-miniconda/envs/r-reticulate/lib:/usr/lib/R/lib:/usr/local/cuda-11.2/lib64:::/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server:/usr/local/cuda-11.2/lib64

我现在该怎么办?为什么我不能让 gpu 加速工作?我的计划是按照那篇博文中的说明,从 Ubuntu 中清除所有 Nvidia 软件,然后使用 10.1 重试,因为这似乎是最稳定的版本。

【问题讨论】:

  • 您的 TF 期望使用 CUDA 11.0。你有 CUDA 11.2。您不能使用 CUDA 11.2 替代 CUDA 11.0
  • @RobertCrovella 如果我要清除所有与 Nvidia CUDA 相关的软件和显卡驱动程序并重新启动,您建议我使用哪个版本的 CUDA 以实现与 libcudnn 和张量流?

标签: r tensorflow keras


【解决方案1】:

感谢@RobertCrovella,由于版本不匹配,我卸载了 CUDA、cuDNN 等,并使用 cuDNN 8.0 重新安装了 CUDA 11.0 版。

> tensorflow::tf_gpu_configured()
...
tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 8779 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3080, pci bus id: 0000:09:00.0, compute capability: 8.6)
GPU device name:  /device:GPU:0[1] TRUE

【讨论】:

    【解决方案2】:

    我是否理解正确,如果我将 cuda 11.0 和 cuDNN 8.0 安装到 cuda 11.0,那么所有这些错误都会消失吗?

    我已经安装了 cuda 11.2 并找到了 cuDNN 8 到 cuda 11.1。然后我用 python3(3.8 ubuntu 20.04.1 LTS 默认)pip3 和 tensorflow 等安装了它们。在 python 中,rip 似乎正在工作,但在 R 中它已经坏了。 我已经创建了指向现有版本的符号链接,并且 R 代码到达了它应该使用 gpu 的地步,但它被核心转储中止了。

    【讨论】:

    • 最好将此作为评论发布,而不是作为答案发布。
    猜你喜欢
    • 1970-01-01
    • 2020-02-18
    • 1970-01-01
    • 1970-01-01
    • 2023-02-08
    • 2017-10-25
    • 2020-03-13
    • 1970-01-01
    • 2018-10-19
    相关资源
    最近更新 更多