【问题标题】:How can i make Tensor flow train.py use all the available GPU's?如何让 Tensorflow train.py 使用所有可用的 GPU?
【发布时间】:2018-05-05 08:08:36
【问题描述】:

我在本地机器上运行 tensorflow 1.7,其中包含 2 个 GPU,每个大约 8 GB。

训练(train.py)当我使用模型“faster_rcnn_resnet101_coco”时,对象检测正在工作。但是当我尝试运行“faster_rcnn_nas_coco”时,它显示“资源耗尽”的错误

Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-05-02 16:14:53.963966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1
2018-05-02 16:14:53.964071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-02 16:14:53.964083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 1 
2018-05-02 16:14:53.964091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N Y 
2018-05-02 16:14:53.964097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 1:   Y N 
2018-05-02 16:14:53.964566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7385 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-05-02 16:14:53.966360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7552 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1070, pci bus id: 0000:03:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from training/model.ckpt-0
INFO:tensorflow:Restoring parameters from training/model.ckpt-0


Limit:                  7744048333
InUse:                  7699536896
MaxInUse:               7699551744
NumAllocs:                   10260
MaxAllocSize:           4076716032

2018-05-02 16:16:52.223943: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ***********************************************************************************x****************
2018-05-02 16:16:52.223967: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at depthwise_conv_op.cc:358 : Resource exhausted: OOM when allocating tensor with shape[64,672,9,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

我不确定!!如果它同时使用两个 GPU,因为它在使用内存中显示为“7699536896”。通过train.py后,我也尝试了

python train.py \
    --logtostderr \
    --worker_replicas=2 \
    --pipeline_config_path=training/faster_rcnn_resnet101_coco.config \
    --train_dir=training

如果有 2 个 GPU 可用,那么 tensorflow 是否默认同时选择它们?还是需要任何参数?

【问题讨论】:

    标签: tensorflow gpu object-detection


    【解决方案1】:

    我们使用worker_replicas指定的GPU数量。对于 NASNet 案例,尝试减小批大小以使网络适合 GPU。

    【讨论】:

    • 嗨,我设置了export CUDA_VISIBLE_DEVICES=0,1,2 并指定了worker_replicas=3,但它仍然只使用一个GPU。我尝试设置--num_clones=3--ps_tasks=1,但得到一个解包ValueError。有什么想法吗?
    猜你喜欢
    • 2020-02-28
    • 2018-01-02
    • 1970-01-01
    • 2019-04-12
    • 1970-01-01
    • 1970-01-01
    • 2018-12-20
    • 1970-01-01
    • 2021-07-13
    相关资源
    最近更新 更多