由于 CUDA 问题，无法让 tensorflow-gpu 在 R 中工作答案

【问题标题】：Can't get tensorflow-gpu to work in R due to CUDA issues由于 CUDA 问题，无法让 tensorflow-gpu 在 R 中工作
【发布时间】：2021-01-14 14:13:28
【问题描述】：

我正在尝试开始使用 Keras，并且我拥有一种新型的 Nvidia GPU，但尽管我使用的是全新安装的 Ubuntu，但我似乎无法启动它（ 20.04)。

在我第一次尝试时，我注意到 Ubuntu 检测到了我的显卡，所以我通过进入“附加驱动程序”来安装它。然后我使用以下命令安装了 Keras 和 Tensorflow，并且没有产生任何错误。

install.packages("keras")
library(keras)

install_keras(tensorflow = "gpu")

但是，当我尝试实际设置 Keras 模型时，

model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

我收到这条可怕的错误消息：

2021-01-14 09:04:53.188680: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-14 09:04:53.189214: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-14 09:04:53.224466: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-14 09:04:53.224843: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:09:00.0 name: GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.785GHz coreCount: 68 deviceMemorySize: 9.78GiB deviceMemoryBandwidth: 707.88GiB/s
2021-01-14 09:04:53.224860: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-14 09:04:53.226413: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-14 09:04:53.226446: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-14 09:04:53.226935: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-14 09:04:53.227061: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-14 09:04:53.227139: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/arta/.local/share/r-miniconda/envs/r-reticulate/lib:/usr/lib/R/lib:/usr/local/cuda-11.2/lib64:::/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server:/usr/local/cuda-11.2/lib64
2021-01-14 09:04:53.227437: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-01-14 09:04:53.227513: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-14 09:04:53.227519: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-01-14 09:04:53.228275: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-14 09:04:53.228290: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-14 09:04:53.228293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]

您可能会注意到，此错误消息提到了cuda-11.2，但是，当我使用系统默认的cuda-10.1 时，我收到了几乎相同的错误消息，我想这是驱动程序附带的。

我做了很多事情，包括直接从 Nvidia 的网站下载并尝试使用他们的文档安装 cuDNN，并将 cuda 添加到 PATH 和 LD_LIBRARY_PATH，但无济于事。

最后，我删除了我的 r-reticulate conda 环境，这样我就可以从头开始重新安装 Tensorflow，但使用 cuda 11.2 而不是默认的 10.1。

我按照this blog post 上的说明进行操作，但我将 10.1 的每个实例替换为 11.2，并将 libcudnn.so.7 替换为 libcudnn.so.8，因为这是可用的最新版本，也是我下载到系统中的版本，这让我看到了上面的错误消息，这与我使用 10.1 时得到的几乎相同，这是我的计算机默认设置的。

另外，当我再次尝试在 R 中使用 Tensorflow 时，我注意到一些奇怪的事情。我使用install_keras(tensorflow = "gpu") 安装它，没有明显问题，但是当我调用以下命令时：

imdb <- dataset_imdb(num_words = 10000)

它再次开始为我下载和安装它，但它给了我这个警告：

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-gpu 2.2.0 requires tensorboard<2.3.0,>=2.2.0, but you have tensorboard 2.4.0 which is incompatible.
tensorflow-gpu 2.2.0 requires tensorflow-estimator<2.3.0,>=2.2.0, but you have tensorflow-estimator 2.4.0 which is incompatible.

我该怎么做？为什么它可以使用正确的CUDA安装：

2021-01-14 09:00:06.766462: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

但它不能在其他地方使用另一个文件？

2021-01-14 09:04:53.227139: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/arta/.local/share/r-miniconda/envs/r-reticulate/lib:/usr/lib/R/lib:/usr/local/cuda-11.2/lib64:::/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server:/usr/local/cuda-11.2/lib64

我现在该怎么办？为什么我不能让 gpu 加速工作？我的计划是按照那篇博文中的说明，从 Ubuntu 中清除所有 Nvidia 软件，然后使用 10.1 重试，因为这似乎是最稳定的版本。

【问题讨论】：

您的 TF 期望使用 CUDA 11.0。你有 CUDA 11.2。您不能使用 CUDA 11.2 替代 CUDA 11.0
@RobertCrovella 如果我要清除所有与 Nvidia CUDA 相关的软件和显卡驱动程序并重新启动，您建议我使用哪个版本的 CUDA 以实现与 libcudnn 和张量流？

标签： r tensorflow keras

【解决方案1】：

感谢@RobertCrovella，由于版本不匹配，我卸载了 CUDA、cuDNN 等，并使用 cuDNN 8.0 重新安装了 CUDA 11.0 版。

> tensorflow::tf_gpu_configured()
...
tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 8779 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3080, pci bus id: 0000:09:00.0, compute capability: 8.6)
GPU device name:  /device:GPU:0[1] TRUE

【讨论】：

【解决方案2】：

我是否理解正确，如果我将 cuda 11.0 和 cuDNN 8.0 安装到 cuda 11.0，那么所有这些错误都会消失吗？

我已经安装了 cuda 11.2 并找到了 cuDNN 8 到 cuda 11.1。然后我用 python3（3.8 ubuntu 20.04.1 LTS 默认）pip3 和 tensorflow 等安装了它们。在 python 中，rip 似乎正在工作，但在 R 中它已经坏了。我已经创建了指向现有版本的符号链接，并且 R 代码到达了它应该使用 gpu 的地步，但它被核心转储中止了。

【讨论】：

最好将此作为评论发布，而不是作为答案发布。