无法识别平台 GPU 的 NUMA 节点答案

【问题标题】：Could not identify NUMA node of platform GPU无法识别平台 GPU 的 NUMA 节点
【发布时间】：2019-04-04 08:26:00
【问题描述】：

我尝试让 Tensorflow 在我的机器上启动，但我总是遇到“无法识别 NUMA 节点”错误消息。

我使用的是 Conda 环境：

tensorflow-gpu 1.12.0
cudatoolkit 9.0
cudnn 7.1.2
nvidia-smi 说：驱动程序版本 418.43，CUDA 版本 10.1

这是错误代码：

>>> import tensorflow as tf
>>> tf.Session()
2019-04-04 09:56:59.851321: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-04 09:56:59.950066: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:950] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2019-04-04 09:56:59.950762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce GTX 750 Ti major: 5 minor: 0 memoryClockRate(GHz): 1.0845
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.84GiB
2019-04-04 09:56:59.950794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-04 09:59:45.338767: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-04 09:59:45.338799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-04-04 09:59:45.338810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-04-04 09:59:45.339017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1193] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

很遗憾，我不知道如何处理错误代码。

【问题讨论】：

标签： python tensorflow keras

【解决方案1】：

我可以用新的 conda 环境修复它：

conda create --name tf python=3
conda activate tf
conda install cudatoolkit=9.0 tensorflow-gpu=1.11.0

提供了兼容的 CUDA/TF 组合表here。就我而言，cudatoolkit=9.0 和 tensorflow-gpu=1.12 的组合莫名其妙地导致了 std::bad_alloc 错误。但是，cudatoolkit=9.0 和 tensorflow-gpu=1.11.0 可以正常工作。

【讨论】：

【解决方案2】：

我遇到了同样的问题，我终于发现这是因为您使用 Adam 来优化模型。一旦你使用另一个优化器，它应该可以工作。

【讨论】：