【发布时间】:2021-03-10 18:04:30
【问题描述】:
我通过 Anaconda 安装了 tensorflow。它运行良好,并且已经识别了 GPU 一段时间。但是突然之间,从几天前开始,没有任何带有 tensorflow 的环境可以识别我的 GPU。有人知道要检查什么吗?
我尝试过的:
- 使用 python=3.7 创建了一个全新的环境并安装了 tensorflow-gpu=2.1
- 重新安装 anaconda
- 使用 python=3.6 创建了一个全新的环境并安装了 tensorflow-gpu=1.9
- 安装了 tensorflow-gpu=2.3 并安装了缺少的 cudatoolkit=10.1 和 cudnn=7.6
- 根据open github issue 安装了具有特定内部版本号的 tensorflow-gpu
- 我通过python将环境变量
CUDA_VISIBLE_DEVICES设置为0(TensorFlow : failed call to cuInit: CUDA_ERROR_NO_DEVICE) - 我更新了我的图形驱动程序
- 删除了修改后的注册表项
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers\TdrDelay
我的测试脚本来检查识别的设备:
import tensorflow as tf
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
这是我在每个配置中得到的输出:
> python check.py
2021-03-10 18:48:12.880629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2021-03-10 18:48:14.637784: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2021-03-10 18:48:19.201572: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2021-03-10 18:48:19.705910: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-03-10 18:48:19.715756: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: NB-170
2021-03-10 18:48:19.721085: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: NB-170
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 10539449374211484676
]
系统信息
- 操作系统:Windows 10 Pro(版本 10.0.18363 Build 18363)
- 显卡:NVIDIA GeForce GTX 1650
- Anaconda 1.10
- 将注册表:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers\TdrDelay更改为 15 以训练 Matterport 的 mask r-cnn 实现 - 图形驱动程序 - GEFORCE GAME READY 驱动程序 - 版本:461.72 WHQL;发售日期:2021.2.25;操作系统:Windows 10 64 位;语言:英语
我的 nvdia-smi 输出:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 461.72 Driver Version: 461.72 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1650 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 54C P8 6W / N/A | 132MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
更新 1 (2021-03-14)
我安装了全新的 Anaconda 安装并创建了一个环境(conda create -name tf-gpu tensorflow-gpu=2.1 在我拥有的另一台计算机上。在那台计算机上我的 gpu 被识别,没有任何问题。
2021-03-14 14:21:33.934222: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2021-03-14 14:21:37.608844: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX
2021-03-14 14:21:37.612173: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2021-03-14 14:21:37.658982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 970 computeCapability: 5.2
coreClock: 1.253GHz coreCount: 13 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 208.91GiB/s
2021-03-14 14:21:37.659525: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2021-03-14 14:21:38.216002: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2021-03-14 14:21:38.625300: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2021-03-14 14:21:38.660856: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2021-03-14 14:21:38.971988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2021-03-14 14:21:39.247585: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2021-03-14 14:21:39.564512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2021-03-14 14:21:39.565268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2021-03-14 14:21:41.272007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-14 14:21:41.272272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2021-03-14 14:21:41.272582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2021-03-14 14:21:41.283835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/device:GPU:0 with 2993 MB memory) -> physical GPU (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0, compute capability: 5.2)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 17009642916451828901
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 3139148187
locality {
bus_id: 1
links {
}
}
incarnation: 5677250807137925801
physical_device_desc: "device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0, compute capability: 5.2"
]
【问题讨论】:
-
你安装的是什么版本的nvidia驱动?
-
GEFORCE GAME READY 驱动程序:版本:461.72 WHQL;发售日期:2021.2.25;操作系统:Windows 10 64 位;语言:英语
-
对于 Cuda 11.2 你需要 cuDNN 8.0,你可以在这里找到列表developer.nvidia.com/rdp/cudnn-archive
-
不要忘记将它们包含在您的 PATH 中
-
我正在通过 Anaconda 安装 Cuda 和 cuDNN。如果我正确阅读了我的输出,则 cuda ist 加载正常。所以这不应该是我的路径错误。 nvidia-smi 输出中显示的 Cuda 版本不是已安装的版本,而是图形驱动程序支持的最高版本。所以我的图形驱动程序支持高达 11.2 版的 Cuda。 (请参阅此问题的第一条评论:stackoverflow.com/q/53422407/4295853)
标签: tensorflow anaconda