【问题标题】:Keras with Tensorflow backend does not use GPU带有 Tensorflow 后端的 Keras 不使用 GPU
【发布时间】:2020-02-18 13:06:37
【问题描述】:

我已经按照these instructions 安装了带有 TensorFlow 后端的 Keras:

library(keras)
install_keras(tensorflow = "gpu")

安装很顺利,我没有错误消息。

如果我输入:

k = backend()
sess = k$get_session()
sess$list_devices()

据我了解的输出,我的 GPU 似乎已被识别:

[[1]]
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 3277741456357329757)

[[2]]
_DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 14524037525637335634)

[[3]]
_DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 5788527260077506513)

我的.profile 文件如下所示:

export CUDA_HOME=${CUDA_PATH}
export PATH="${CUDA_PATH}/bin:$PATH"
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CUDA_PATH}/lib64/

我还可以列出所有与 Nvidia 相关的包:

[ben@Solgaleo ~]$ pacman -Qs nvidia*
local/cuda 10.2.89-3
    NVIDIA's GPU programming toolkit
local/cudnn 7.6.5.32-3
    NVIDIA CUDA Deep Neural Network library
local/lib32-nvidia-utils 440.59-1
    NVIDIA drivers utilities (32-bit)
local/libvdpau 1.3-1
    Nvidia VDPAU library
local/libxnvctrl 440.59-1
    NVIDIA NV-CONTROL X extension
local/nvidia 440.59-8
    NVIDIA drivers for linux
local/nvidia-settings 440.59-1
    Tool for configuring the NVIDIA graphics driver
local/nvidia-utils 440.59-1
    NVIDIA drivers utilities
local/nvtop 1.0.0-2
    An htop like monitoring tool for NVIDIA GPUs
local/opencl-nvidia 440.59-1
    OpenCL implemention for NVIDIA

但是当我建立一个 Keras 模型时,一些库文件没有找到:

library(keras)
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y
# reshape
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
# rescale
x_train <- x_train / 255
x_test <- x_test / 255
y_train <- to_categorical(y_train, 10)
y_test <- to_categorical(y_test, 10)
model <- keras_model_sequential()
model %>%
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 10, activation = 'softmax')

这是错误信息:

2020-02-18 13:45:23.530693: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-18 13:45:23.609674: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-18 13:45:23.610276: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 2070 SUPER major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:09:00.0
2020-02-18 13:45:23.610420: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64/R/lib::/opt/cuda/lib64/:::/lib:/usr/lib/jvm/java-7-openjdk/jre/lib/amd64/server::/opt/cuda/lib64/
2020-02-18 13:45:23.610508: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64/R/lib::/opt/cuda/lib64/:::/lib:/usr/lib/jvm/java-7-openjdk/jre/lib/amd64/server::/opt/cuda/lib64/
2020-02-18 13:45:23.610597: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64/R/lib::/opt/cuda/lib64/:::/lib:/usr/lib/jvm/java-7-openjdk/jre/lib/amd64/server::/opt/cuda/lib64/
2020-02-18 13:45:23.610680: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64/R/lib::/opt/cuda/lib64/:::/lib:/usr/lib/jvm/java-7-openjdk/jre/lib/amd64/server::/opt/cuda/lib64/
2020-02-18 13:45:23.610761: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64/R/lib::/opt/cuda/lib64/:::/lib:/usr/lib/jvm/java-7-openjdk/jre/lib/amd64/server::/opt/cuda/lib64/
2020-02-18 13:45:23.610842: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64/R/lib::/opt/cuda/lib64/:::/lib:/usr/lib/jvm/java-7-openjdk/jre/lib/amd64/server::/opt/cuda/lib64/
2020-02-18 13:45:23.646497: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-18 13:45:23.646508: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-02-18 13:45:23.647124: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-18 13:45:23.669292: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3794460000 Hz
2020-02-18 13:45:23.670124: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559b541a72c0 executing computations on platform Host. Devices:
2020-02-18 13:45:23.670138: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-02-18 13:45:23.670530: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-18 13:45:23.670542: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      
2020-02-18 13:45:23.982097: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-18 13:45:23.982507: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x559b5423b030 executing computations on platform CUDA. Devices:
2020-02-18 13:45:23.982529: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2070 SUPER, Compute Capability 7.5

确实,libcudart.so.10.0(例如)找不到,因为它不存在:

[ben@Solgaleo ~]$ ll /opt/cuda/lib64/libcudart.so*
lrwxrwxrwx 1 root root   20 Dec 31 09:07 /opt/cuda/lib64/libcudart.so -> libcudart.so.10.2.89
lrwxrwxrwx 1 root root   20 Dec 31 09:07 /opt/cuda/lib64/libcudart.so.10 -> libcudart.so.10.2.89
lrwxrwxrwx 1 root root   20 Dec 31 09:07 /opt/cuda/lib64/libcudart.so.10.2 -> libcudart.so.10.2.89
-rwxr-xr-x 1 root root 498K Dec 31 09:07 /opt/cuda/lib64/libcudart.so.10.2.89

所以 TensorFlow 正在寻找 10.0 版本,而我已经安装了 10.2。

在训练我的模型时,只使用 CPU。

我在安装 Keras/TensorFlow 时遇到了什么问题?我该如何解决这个问题?

编辑:以下是 Keras 和 TensorFlow R 软件包的版本:

keras_2.2.5.0
tensorflow_2.0.0

【问题讨论】:

  • 查看第一个输出我会说它只识别 CPU:_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 3277741456357329757) TensorFlow 版本适用于特定的 CUDA 版本,因此对于 tf 2.0,您需要降级 CUDA 库(或尽可能并行运行它们)
  • 如果是这种情况(顺便说一下:stackoverflow.com/questions/55925068/…),事情就没那么简单了……当我尝试降级 CUDA 时,它拒绝了,因为它找不到 gcc7(我的系统最多日期,所以我运行 gcc 9)。如果我降级 gcc,我害怕破坏某些东西......
  • 显然,TensorFlow 2.1.0 支持 CUDA 10.1 (tensorflow.org/install/gpu),这可能更容易降级(虽然我没有尝试过)。 TF 通过 R 脚本安装后如何升级到 2.1.0?
  • 看起来可以通过指定tf的版本来完成:tensorflow.rstudio.com/installationinstall_tensorflow(version = "2.1.0")。关于CUDA安装问题,你用的是什么操作系统?
  • nvm os的问题,刚看到pacman命令;我用yaourt cuda-10.1 安装了cuda-10.1,这样任何依赖项(如gcc)都将得到解决

标签: r tensorflow keras


【解决方案1】:

根据我们讨论的 cmets 添加问题的部分答案(作为一些错误我不知道如何解决,但也许有人可以添加到此)。

在最初的问题上,tensorflow 似乎只能看到 CPU: CPU : _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 3277741456357329757)

Tensorflow 版本与 CUDA 版本无关,因此您需要确保它们兼容。 TF 2.0 期待 CUDA 10.0,因此您需要始终仔细检查。您可以升级到 tf 2.1 并将 CUDA 降级到 10.1,方法是使用 install_tensorflow(version = "2.1.0") in R 并在 arch linux 中使用 yaourt cuda-10.1 以获得具有所有依赖项的正确版本的 CUDA。

从 CUDA 10.0 开始,您还需要安装 TensorRT 依赖项以使用一些加速属性(tensorflow 正在使用);为此,您需要从 NVidia developer downloads(需要帐户)下载 TensorRT 包并使用 AUR 存储库进行安装。

关于progbar 错误,我不能 100% 确定,因为我以前没有见过它,但看起来它可能与 tensorboard 有关,所以请确保您也安装了适当的版本。

【讨论】:

  • 关于 progbar 错误,显然我不是唯一一个有问题的人:github.com/rstudio/keras/issues/992。奇怪的是,两次调用导致错误的函数“修复”了问题
猜你喜欢
  • 2018-02-15
  • 2020-06-13
  • 2016-12-19
  • 2018-07-18
  • 2018-09-26
  • 2021-11-28
  • 1970-01-01
  • 2019-04-10
  • 2017-10-11
相关资源
最近更新 更多