tensorflow object detection api 和 bboxes by image frame答案

【问题标题】：tensorflow object detection api and bboxes by image frametensorflow object detection api 和 bboxes by image frame
【发布时间】：2018-07-30 23:40:19
【问题描述】：

通过标签 [python] 阅读其他人的问题\答案我遇到了 Banach Tarski TensorFlow Object Detection API Weird Behavior 的惊人作品。所以，我想重试他的所作所为，以更深入地了解Tensorflow Object Detection API。我一步一步地按照他所做的以及我使用Grocery Dataset 的方式进行操作。 fast_rcnn_resnet101 模型采用默认参数，batch_size = 1。

真正的区别是我没有使用 Shelf_Images 和每个班级的注释和 bbs，而是 Product_Images，其中有 10 个文件夹（每个文件夹对应一个班级），在每个文件夹中，您可以看到没有任何背景的全尺寸香烟图像. Product_Images 的平均大小为 600*1200，而 Shelf_Images 为 3900*2100。所以，我想为什么我不能拍摄这些完整的图像并从中取出边界框，然后对其进行训练并获得成功的结果。顺便说一句，我不需要像 Banach Tarski 那样手动裁剪图像，因为 600*1200 非常适合 fast_rcnn_resnet101 神经网络模型及其输入图像的默认参数。

以 Pall Mall 课程中的一张图片为例

这看起来很简单，因为我可以仅通过图像的轮廓创建 bbox。因此，我只需要为每个图像创建注释并从中创建 tf_records 以进行训练。我采用了通过图像轮廓创建bbox的公式

x_min = str(1)
y_min = str(1)
x_max = str(img.width - 10)
y_max = str(img.height - 10)

xml注解示例

<annotation>
    <folder>VOC2007</folder>
    <filename>B1_N1.jpg</filename>
    <path>/.../grocery-detection/data/images/1/B1_N1.jpg</path>
    <source>
        <database>The VOC2007 Database</database>
        <annotation>PASCAL VOC2007</annotation>
        <image>flickr</image>
        <flickrid>192073981</flickrid>
    </source>
    <owner>
        <flickrid>tobeng</flickrid>
        <name>?</name>
    </owner>
    <size>
        <width>811</width>
        <height>1274</height>
        <depth>3</depth>
    </size>
    <segmented>0</segmented>
    <object>
        <name>1</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <bndbox>
            <xmin>1</xmin>
            <ymin>1</ymin>
            <xmax>801</xmax>
            <ymax>1264</ymax>
        </bndbox>
    </object>
</annotation>

在对所有文件夹图像进行脚本迭代后，我为每个图像注释获得了类似于上面在 VOC2007 xml 类型中显示的内容。然后我创建了 tf_records 迭代每个注释，就像在 pet_running 示例中一样，由 tensorflow 完成，现在一切看起来都很好，可以在 AWS Nvidia Tesla k80 上进行训练

用于创建 Tf_records 的 feature_dict 示例

feature_dict = {
      'image/height': dataset_util.int64_feature(height),
      'image/width': dataset_util.int64_feature(width),
      'image/filename': dataset_util.bytes_feature(
          data['filename'].encode('utf8')),
      'image/source_id': dataset_util.bytes_feature(
          data['filename'].encode('utf8')),
      'image/key/sha256': dataset_util.bytes_feature(key.encode('utf8')),
      'image/encoded': dataset_util.bytes_feature(encoded_jpg),
      'image/format': dataset_util.bytes_feature('jpeg'.encode('utf8')),
      'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
      'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
      'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
      'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
      'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
      'image/object/class/label': dataset_util.int64_list_feature(classes),
      'image/object/difficult': dataset_util.int64_list_feature(difficult_obj),
      'image/object/truncated': dataset_util.int64_list_feature(truncated),
      'image/object/view': dataset_util.bytes_list_feature(poses),
}

在 12458 步后，每步 1 张图像，模型收敛到局部最小值。我保存了所有检查点和图表。接下来，我创建了推理图并运行 object_detection_tutorial.py 以展示它在我的测试图像上的工作原理。但我对结果一点也不满意。 P.S 最后一张图像的尺寸为 1024 × 760，并且也被裁剪为第三张图像的顶部，其尺寸为 3264 × 2448。所以我尝试了不同尺寸的香烟图像，以便在按模型缩放图像时不小心丢失图像细节。

输出：带有预测 bbox 的分类图像

【问题讨论】：

标签： python tensorflow computer-vision deep-learning object-detection

【解决方案1】：

我认为问题在于，您的网络了解到对象与输入图像的大小几乎相同，因为每个训练图像仅包含一个与输入图像本身大小几乎相同的正对象。

我认为您的数据集对于香烟包装分类器来说是一个很好的起点，但对于物体检测器来说却不是。

Faster-R-CNN 模型需要带有对象的样本，但也需要背景。然后，该模型将分两步在图像中找到对象。在第一步中，所谓的区域提议网络将在图像中寻找有趣的区域。然后将在第二步中对这些有趣的区域进行分类。通过这第二步，模型决定一个区域是实际的对象还是仅仅是背景。

因此，要训练香烟对象检测器，您需要大量样本，例如帖子的最后一张图片，其中所有对象（香烟包装）都标有单独的 BBOX 和类别标签。

【讨论】：

很好的解释。谢谢你的建议。希望这对其他人也有帮助/