为什么 Keras 看不到我的 GPU 而 TensorFlow 可以？答案

【问题标题】：Why Keras does not see my GPU while TensorFlow does?为什么 Keras 看不到我的 GPU 而 TensorFlow 可以？
【发布时间】：2019-05-10 19:42:23
【问题描述】：

按照answer from SO，我已经跑了：

# confirm TensorFlow sees the GPU
from tensorflow.python.client import device_lib
assert 'GPU' in str(device_lib.list_local_devices())

# confirm Keras sees the GPU
from keras import backend
assert len(backend.tensorflow_backend._get_available_gpus()) > 0

# confirm PyTorch sees the GPU
from torch import cuda
assert cuda.is_available()
assert cuda.device_count() > 0
print(cuda.get_device_name(cuda.current_device()))

第一个测试有效，而其他测试无效。

运行nvcc --version 给出：

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

nvidia-smi 也可以。

list_local_devices() 提供：

[名称：“/设备：CPU：0”设备类型：“CPU”内存限制：268435456 地点 {} 化身：459307207819325532，名称： "/device:XLA_GPU:0" device_type: "XLA_GPU" memory_limit: 17179869184 位置 {} 化身：9054555249843627113 物理设备描述： “设备：XLA_GPU 设备”，名称：“/device:XLA_CPU:0” 设备类型： "XLA_CPU" memory_limit: 17179869184 locality { } 化身： 5902450771458744885 physical_device_desc：“设备：XLA_CPU设备”]

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) 返回：

设备映射： /job:localhost/replica:0/task:0/device:XLA_GPU:0 -> 设备：XLA_GPU 设备 /job:localhost/replica:0/task:0/device:XLA_CPU:0 -> 设备：XLA_CPU 设备

为什么 Keras 和 PyTorch 无法在我的 GPU 上运行？（RTX 2070）

【问题讨论】：

这是什么keras版本？
实际上它也不适用于 tf tf.test.is_gpu_available() 返回 False
@ParitoshSingh keras 是 2.2.4
哦，好的，如果它也不能与 tensorflow 一起使用，那么您需要为 gpu 安装 tensorflow。它涉及的步骤不仅仅是 pip 安装。
什么意思？我用 pip 安装了 tensorflow-gpu

标签： python tensorflow keras gpu

【解决方案1】：

我很难找到问题所在。实际上，运行 CUDA 示例为我提供了很好的见解：

CUDA error at ../../common/inc/helper_cuda.h:1162 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)"

使用 sudo 时： MapSMtoCores for SM 7.5 is undefined. Default to use 64 Cores/SM GPU Device 0: "GeForce RTX 2070" with compute capability 7.5

所以问题是我的库不是每个人都可读的。

我的错误已通过以下方式修复：

sudo chmod -R a+r /usr/local/cuda*

【讨论】：

【解决方案2】：

我最近遇到了这个问题。事实证明，pip 安装的必需包（例如 keras）不包含 XLA 相关标志。如果我更改为所需软件包的完整 miniconda 或 anaconda 安装，我就可以运行我的代码。就我而言，我正在运行 facebook AI 代码。

存在问题的早期指标正在运行：

nvidia-smi

并看到您的深度网络没有使用千兆位的数据，而是使用千字节。然后，即使没有警告（有时很难在日志中找到），您也知道问题出在必要软件的编译方式上。您知道这一点是因为 GPU 在设备类型上没有得到匹配，因此默认使用 CPU。然后将代码卸载到 CPU 上。

就我而言，我使用 miniconda 安装了 tensorflow-gpu、ipython、imutils、imgaug 和其他一些软件包。如果您发现 conda 中缺少必需的包，请使用：

conda -c conda-forge <package-name>

捡起丢失的物品，例如 imutils 和 imgaug。

【讨论】：