在启用 GPU 的情况下运行推理时，Tensorflow 挂起答案

【问题标题】：Tenserflow hangs when running inference with GPU enabled在启用 GPU 的情况下运行推理时，Tensorflow 挂起
【发布时间】：2020-04-23 08:16:09
【问题描述】：

我是 AI 和 TensorFlow 的新手，我正在尝试在 Windows 上使用 TensorFlow 对象检测 API。
我目前的目标是在视频流中进行实时人体检测。
为此，我修改了 TensorFlow 模型花园 (https://github.com/tensorflow/models) 中的一个 python 示例。
目前它检测所有对象（不仅仅是人类）并使用 opencv 显示边界框。

当我禁用 GPU 时它工作正常 (os.environ["CUDA_VISIBLE_DEVICES"] = "-1")
但是当我启用 GPU 并启动脚本时，它会挂在第一帧上。

输出：

2020-04-22 16:00:53.597492: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-22 16:00:56.942141: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-04-22 16:00:56.976635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 74.65GiB/s
2020-04-22 16:00:56.989129: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-22 16:00:57.000622: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-22 16:00:57.012247: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-22 16:00:57.020575: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-22 16:00:57.031536: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-22 16:00:57.042564: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-22 16:00:57.066289: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-22 16:00:57.075760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-04-22 16:00:59.239211: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-04-22 16:00:59.256577: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1f3f73cd670 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-22 16:00:59.264241: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-22 16:00:59.272280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 960M computeCapability: 5.0
coreClock: 1.176GHz coreCount: 5 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 74.65GiB/s
2020-04-22 16:00:59.281409: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-22 16:00:59.288204: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-22 16:00:59.293112: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-22 16:00:59.298222: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-22 16:00:59.305446: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-22 16:00:59.310590: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-22 16:00:59.316250: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-22 16:00:59.324588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0
2020-04-22 16:01:00.831569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-22 16:01:00.839147: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0
2020-04-22 16:01:00.842279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N
2020-04-22 16:01:00.846140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1024 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
2020-04-22 16:01:00.865546: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1f39174cba0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-22 16:01:00.873656: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 960M, Compute Capability 5.0
[<tf.Tensor 'image_tensor:0' shape=(None, None, None, 3) dtype=uint8>]
2020-04-22 16:01:10.876733: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-22 16:01:11.814909: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2020-04-22 16:01:11.852909: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-22 16:01:12.149312: W tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.04GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-04-22 16:01:12.179484: W tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.04GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-04-22 16:01:12.209036: W tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-04-22 16:01:12.237205: W tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.05GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-04-22 16:01:12.266147: W tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-04-22 16:01:12.295182: W tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.08GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-04-22 16:01:12.325645: W tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.15GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-04-22 16:01:12.357550: W tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.15GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-04-22 16:01:12.405332: W tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.14GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-04-22 16:01:12.436336: W tensorflow/core/common_runtime/bfc_allocator.cc:245] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

这是我正在使用的代码：

#!/usr/bin/env python
# coding: utf-8

import os
import pathlib

if "models" in pathlib.Path.cwd().parts:
  while "models" in pathlib.Path.cwd().parts:
    os.chdir('..')

import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile

from collections import defaultdict
from io import StringIO
from PIL import Image
from IPython.display import display

import cv2 
cap = cv2.VideoCapture(1)

from object_detection.utils import ops as utils_ops
from object_detection.utils import label_map_util
from object_detection.utils import visualization_utils as vis_util

# patch tf1 into `utils.ops`
utils_ops.tf = tf.compat.v1

# Patch the location of gfile
tf.gfile = tf.io.gfile

# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

def load_model(model_name):
  base_url = 'http://download.tensorflow.org/models/object_detection/'
  model_file = model_name + '.tar.gz'
  model_dir = tf.keras.utils.get_file(
    fname=model_name, 
    origin=base_url + model_file,
    untar=True)

  model_dir = pathlib.Path(model_dir)/"saved_model"

  model = tf.saved_model.load(str(model_dir))
  model = model.signatures['serving_default']

  return model

# List of the strings that is used to add correct label for each box.
PATH_TO_LABELS = 'models/research/object_detection/data/mscoco_label_map.pbtxt'
category_index = label_map_util.create_category_index_from_labelmap(PATH_TO_LABELS, use_display_name=True)

model_name = 'ssd_mobilenet_v1_coco_2017_11_17'
# model_name= 'faster_rcnn_inception_v2_coco_2017_11_08';
detection_model = load_model(model_name)

print(detection_model.inputs)

detection_model.output_dtypes
detection_model.output_shapes

def run_inference_for_single_image(model, image):
    image = np.asarray(image)
    # The input needs to be a tensor, convert it using `tf.convert_to_tensor`.
    input_tensor = tf.convert_to_tensor(image)
    # The model expects a batch of images, so add an axis with `tf.newaxis`.
    input_tensor = input_tensor[tf.newaxis,...]

    # Run inference (it hangs here)
    output_dict = model(input_tensor)

    # All outputs are batches tensors.
    # Convert to numpy arrays, and take index [0] to remove the batch dimension.
    # We're only interested in the first num_detections.
    num_detections = int(output_dict.pop('num_detections'))
    output_dict = {key:value[0, :num_detections].numpy() 
                 for key,value in output_dict.items()}
    output_dict['num_detections'] = num_detections

    # detection_classes should be ints.
    output_dict['detection_classes'] = output_dict['detection_classes'].astype(np.int64)

    # Handle models with masks:
    if 'detection_masks' in output_dict:
        # Reframe the the bbox mask to the image size.
        detection_masks_reframed = utils_ops.reframe_box_masks_to_image_masks(output_dict['detection_masks'], output_dict['detection_boxes'],image.shape[0], image.shape[1])      
        detection_masks_reframed = tf.cast(detection_masks_reframed > 0.5,tf.uint8)
        output_dict['detection_masks_reframed'] = detection_masks_reframed.numpy()

    return output_dict

def show_inference(model):
    # the array based representation of the image will be used later in order to prepare the
    # result image with boxes and labels on it.
    ret, image_np = cap.read()

    #percent by which the image is resized
    #scale_percent = 30

    #calculate the 50 percent of original dimensions
    #width = int(image_np.shape[1] * scale_percent / 100)
    #height = int(image_np.shape[0] * scale_percent / 100)

    # dsize
    #dsize = (width, height)

    # resize image
    #image_np = cv2.resize(image_np, dsize)

    # Actual detection.
    output_dict = run_inference_for_single_image(model, image_np)

    # Visualization of the results of a detection.
    vis_util.visualize_boxes_and_labels_on_image_array(
      image_np,
      output_dict['detection_boxes'],
      output_dict['detection_classes'],
      output_dict['detection_scores'],
      category_index,
      instance_masks=output_dict.get('detection_masks_reframed', None),
      use_normalized_coordinates=True,
      line_thickness=8)

    cv2.imshow('object detection', cv2.resize(image_np, (800,600)))

while True:
  show_inference(detection_model)
  if cv2.waitKey(25) & 0xFF == ord('q'):
    cv2.destroyAllWindows()
    break

我安装了以下版本：
Python：3.7 64 位
张量流：2.2.0-rc3
库达：10.1
cudnn 7.6.5.32

我在两台不同的机器上试过这个：
机器 1：
- CPU：i7-6700HQ
- 内存：16 GB
- GPU：NVIDIA GeForce GTX 960M

机器 2：
- CPU：i5-6400
- 内存：16 GB
- GPU：NVIDIA GeForce GTX 960

我不确定如何调试它。我在两台不同的机器上尝试了相同的代码，结果几乎相同。
唯一的区别是它挂起的时间。机器 1 会立即挂起，机器 2 大约需要 30 秒。
机器 2 能够处理视频并检测对象直到挂起。

我查看了“分配器 (GPU_0_bfc) 内存不足”警告。
我尝试了一些限制可用 GPU 内存大小的选项，但这没有帮助。

还有多个帖子建议减少批量大小。
我的解释是，这仅在训练您自己的模型时才有用。
而且因为我使用的是预训练模型，所以这不适用。

我还尝试使用不同的模型：ssd_mobilenet_v1_coco_2017_11_17 和 faster_rcnn_inception_v2_coco_2017_11_08。两种模型的结果相同。

我尝试的最后一件事是在处理之前减小图像大小。这也没有帮助。

任何帮助将不胜感激

更新
我还在 RTX2070 超级 GPU 上进行了尝试。没有关于内存分配的警告。这也无法完成单一的推理。为了完整起见，这是控制台输出 [在运行推理之前打印文本“推理开始”。如果推理完成，它将打印“推理结束”]：

2020-04-24 11:30:16.579805: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-24 11:30:18.916146: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2020-04-24 11:30:18.941805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.785GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-04-24 11:30:18.946134: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-24 11:30:18.951172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-24 11:30:18.954809: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-24 11:30:18.957258: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-24 11:30:18.961662: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-24 11:30:18.965553: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-24 11:30:18.978671: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-24 11:30:18.980998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-24 11:30:18.982226: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-04-24 11:30:18.984167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 SUPER computeCapability: 7.5
coreClock: 1.785GHz coreCount: 40 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2020-04-24 11:30:18.987291: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2020-04-24 11:30:18.988809: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2020-04-24 11:30:18.990303: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2020-04-24 11:30:18.991792: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2020-04-24 11:30:18.993320: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2020-04-24 11:30:18.996960: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2020-04-24 11:30:18.998497: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-24 11:30:19.000191: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-24 11:30:19.430864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-24 11:30:19.433076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0
2020-04-24 11:30:19.434566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N
2020-04-24 11:30:19.436400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6281 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
[<tf.Tensor 'image_tensor:0' shape=(None, None, None, 3) dtype=uint8>]
inference start
2020-04-24 11:30:24.728554: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-24 11:30:25.608426: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-04-24 11:30:25.625904: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll

更新 2
当 Eager 模式被禁用时，一切都运行良好（即使在 GPU 上），但是我无法检索找到的对象。
我尝试的下一件事是使用会话运行它（我认为像 TensorFlow 1）。这里的 session.run() 函数在 GPU 上无限期地阻塞。再次在 CPU 上运行良好。

【问题讨论】：

您好，只是一个建议。也许您可以在设备设置为 GPU 的情况下在 Google Colab 上尝试您的代码（并且可能将一张图片上传到 Colab）。它们提供对至少具有 12 GB 内存的 K40 的免费访问，这应该足以进行推理。如果有效，您就知道这是内存问题。
您好，感谢您的回复。我会努力让它在那里工作。

标签： python tensorflow object-detection-api

【解决方案1】：

如果您使用的是 GPU，请尝试安装 tensorflow-gpu。您使用的 tensorflow 似乎基于支持 GPU 的文档，但您可以尝试隐式指定。首先在 python 虚拟环境中尝试。

    pip uninstall tensorflow

卸载 tensorflow-gpu：（即使您不确定是否安装了它，也要确保运行它）

    pip uninstall tensorflow-gpu

安装特定的tensorflow-gpu版本：

    pip install tensorflow-gpu==2.0.0

【讨论】：

我正在使用 TensorFlow 2。我从文档中了解到这已经包括 GPU 支持 (tensorflow.org/install/gpu)。从 Tensorflow 1.15 及更早版本开始，有一个单独的 gpu 安装。
您确定您使用的是 GPU 吗？启用 GPU 时，您会收到以下消息：添加可见 gpu 设备：0 您的 CPU 支持此 TensorFlow 二进制文件未编译使用的指令：AVX2，而它仅在 CPU 上工作正常。
我卸载了 tensorflow 并安装了 tensorflow-gpu 2.0.0。就像你建议的那样。我也是 pip 和 python 的新手。我使用以下命令安装东西： python -m pip install ..... 这有关系吗？无论如何，我再次运行代码，现在它卡在了以下行：'model = tf.saved_model.load(str(model_dir))' 终端输出看起来与我之前的相似。所以也许我使用模型的方式是错误的？当前的方式是基于对象检测 API 中的示例。它从服务器下载模型，然后使用它。
我认为只要你没有安装各种python版本就可以了。你可以随时检查 python --version ， python3 --version 。您现在遇到的错误是什么？
我忘记关闭之前的挂起，结果第一次加载模型只需要很长时间（几分钟）（不知道为什么）。当我现在运行它时，它会在大约一秒钟内加载。但是，它在运行推理时仍然会卡住。我得到 10 个已处理的帧，然后它停止了。我让它挂了一段时间（也许这是一个单一的时间事件），但到目前为止（等待 2 分钟）它仍然卡住了。我没有收到任何错误，它只是永远挂起。