【问题标题】:Tensorflow is unable to run code on too many GPUs?TensorFlow 无法在太多 GPU 上运行代码?
【发布时间】:2018-03-30 18:14:08
【问题描述】:

我有以下测试代码:

    import tensorflow as tf
import numpy as np

def body(x):
    a = tf.random_uniform(shape=[2, 2], dtype=tf.int32, maxval=100)
    b = tf.constant(np.array([[1, 2], [3, 4]]), dtype=tf.int32)
    c = a + b
    return tf.nn.relu(x + c)

def condition(x):
    return tf.reduce_sum(x) < 100

x = tf.Variable(tf.constant(0, shape=[2, 2]))

with tf.Session():
    tf.initialize_all_variables().run()
    result = tf.while_loop(condition, body, [x])
    print(result.eval())

当我在我的 GPU 集群上运行它时,我会产生以下错误:

2018-03-30 18:10:33.473913: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10415 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:3d:00.0, compute capability: 6.1)
2018-03-30 18:10:33.591203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10415 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:3e:00.0, compute capability: 6.1)
2018-03-30 18:10:33.688390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10415 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:60:00.0, compute capability: 6.1)
2018-03-30 18:10:33.806845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10415 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:61:00.0, compute capability: 6.1)
2018-03-30 18:10:33.913200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 10415 MB memory) -> physical GPU (device: 4, name: GeForce GTX 1080 Ti, pci bus id: 0000:b1:00.0, compute capability: 6.1)
2018-03-30 18:10:34.018533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 10415 MB memory) -> physical GPU (device: 5, name: GeForce GTX 1080 Ti, pci bus id: 0000:b2:00.0, compute capability: 6.1)
Killed

当我使用CUDA_VISIBLE_DEVICES='6' python script.py 运行脚本时,它会使用 GPU 中止。这可能是什么原因造成的?这可能是有缺陷的 GPU 吗?

nvidia-smi 报告以下内容:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25                 Driver Version: 390.25                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:3D:00.0 Off |                  N/A |
| 28%   21C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:3E:00.0 Off |                  N/A |
| 28%   21C    P8     7W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:60:00.0 Off |                  N/A |
| 28%   24C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:61:00.0 Off |                  N/A |
| 28%   25C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce GTX 108...  Off  | 00000000:B1:00.0 Off |                  N/A |
| 28%   19C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce GTX 108...  Off  | 00000000:B2:00.0 Off |                  N/A |
| 28%   20C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  GeForce GTX 108...  Off  | 00000000:DA:00.0 Off |                  N/A |
| 28%   22C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  GeForce GTX 108...  Off  | 00000000:DB:00.0 Off |                  N/A |
| 28%   21C    P8     8W / 250W |      0MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Tensorflow 版本为1.7.0,CUDA 版本为9.0.176

【问题讨论】:

  • 您的代码在我的环境中没有产生任何错误。
  • 因此,这是设备的问题?
  • 你能告诉CUDA版本和Tensorflow版本吗?
  • 然后运行nvidia-smi 并显示输出
  • 如果我使用较低的 TF 版本,可以吗?

标签: python tensorflow gpu


【解决方案1】:

问题是我在创建作业以使用这么多 GPU 时没有请求足够的 RAM 空间。要使用 8 个 GPU,您需要大量空间,可能需要 ~60 Gi。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-04-12
    • 2017-02-08
    • 2018-12-16
    • 1970-01-01
    • 1970-01-01
    • 2018-05-12
    • 1970-01-01
    相关资源
    最近更新 更多