训练不断停止。元组错误。 (Tensorflow Object_detection API)答案

【问题标题】：Training keeps stopping. Tuple error. (Tensorflow Object_detection API)训练不断停止。元组错误。 (Tensorflow Object_detection API)
【发布时间】：2019-02-24 13:22:42
【问题描述】：

我正在使用 tensorflow 的对象检测 API，每当我执行训练时，它会在几次迭代后停止。最初我的图像是 jpg 格式的，我从中创建了转换为 CSV 的 XML 文件，但是，人们提到错误的原因可能是使用 jpg 而不是 jpeg（尽管其他人已经让它以 jpg 格式工作）。然后我将我的图像转换为 jpeg 并执行其余步骤，然后进行培训，同样的问题出现了。我一直在这个问题上停留了很长时间无济于事，而且似乎没有很多可行的解决方案。如果有人有解决此问题的想法，我将不胜感激。代码如下

Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:root:Variable [Conv/biases/Momentum] is not available in checkpoint
WARNING:root:Variable [Conv/weights/Momentum] is not available in checkpoint
WARNING:root:Variable [FirstStageBoxPredictor/BoxEncodingPredictor/biases/Momentum] is not available in checkpoint
WARNING:root:Variable [FirstStageBoxPredictor/BoxEncodingPredictor/weights/Momentum] is not available in checkpoint

....

    INFO:tensorflow:global step 1: loss = 1.6760 (13.660 sec/step)
INFO:tensorflow:global step 1: loss = 1.6760 (13.660 sec/step)
INFO:tensorflow:Finished training! Saving model to disk.
INFO:tensorflow:Finished training! Saving model to disk.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened.
  warnings.warn("Attempting to use a closed FileWriter. "
Traceback (most recent call last):
  File "train.py", line 185, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "train.py", line 181, in main
    graph_hook_fn=graph_rewriter_fn)
  File "/usr/local/lib/python3.6/dist-packages/object_detection-0.1-py3.6.egg/object_detection/legacy/trainer.py", line 416, in train
    saver=saver)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 785, in train
    ignore_live_threads=ignore_live_threads)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 832, in stop
    ignore_live_threads=ignore_live_threads)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 257, in _run
    enqueue_callable()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1257, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape mismatch in tuple component 18. Expected [1,?,?,3], got [1,1,314,384,3]
     [[{{node batch/padding_fifo_queue_enqueue}}]]

Train.py

    """Training executable for detection models.

This executable is used to train DetectionModels. There are two ways of
configuring the training job:

1) A single pipeline_pb2.TrainEvalPipelineConfig configuration file
can be specified by --pipeline_config_path.

Example usage:
    ./train \
        --logtostderr \
        --train_dir=path/to/train_dir \
        --pipeline_config_path=pipeline_config.pbtxt

2) Three configuration files can be provided: a model_pb2.DetectionModel
configuration file to define what type of DetectionModel is being trained, an
input_reader_pb2.InputReader file to specify what training data will be used and
a train_pb2.TrainConfig file to configure training parameters.

Example usage:
    ./train \
        --logtostderr \
        --train_dir=path/to/train_dir \
        --model_config_path=model_config.pbtxt \
        --train_config_path=train_config.pbtxt \
        --input_config_path=train_input_config.pbtxt
"""
#changed  object_detection.builders/legacy/utils to builders...

import functools
import json
import os
import tensorflow as tf

from builders import dataset_builder
from builders import graph_rewriter_builder
from builders import model_builder
from legacy import trainer
from utils import config_util

tf.logging.set_verbosity(tf.logging.INFO)

flags = tf.app.flags
flags.DEFINE_string('master', '', 'Name of the TensorFlow master to use.')
flags.DEFINE_integer('task', 0, 'task id')
flags.DEFINE_integer('num_clones', 1, 'Number of clones to deploy per worker.')
flags.DEFINE_boolean('clone_on_cpu', False,
                     'Force clones to be deployed on CPU.  Note that even if '
                     'set to False (allowing ops to run on gpu), some ops may '
                     'still be run on the CPU if they have no GPU kernel.')
flags.DEFINE_integer('worker_replicas', 1, 'Number of worker+trainer '
                     'replicas.')
flags.DEFINE_integer('ps_tasks', 0,
                     'Number of parameter server tasks. If None, does not use '
                     'a parameter server.')
flags.DEFINE_string('train_dir', '',
                    'Directory to save the checkpoints and training summaries.')

flags.DEFINE_string('pipeline_config_path', '',
                    'Path to a pipeline_pb2.TrainEvalPipelineConfig config '
                    'file. If provided, other configs are ignored')

flags.DEFINE_string('train_config_path', '',
                    'Path to a train_pb2.TrainConfig config file.')
flags.DEFINE_string('input_config_path', '',
                    'Path to an input_reader_pb2.InputReader config file.')
flags.DEFINE_string('model_config_path', '',
                    'Path to a model_pb2.DetectionModel config file.')

FLAGS = flags.FLAGS


@tf.contrib.framework.deprecated(None, 'Use object_detection/model_main.py.')
def main(_):

  assert FLAGS.train_dir, '`train_dir` is missing.'
  if FLAGS.task == 0: tf.gfile.MakeDirs(FLAGS.train_dir)
  if FLAGS.pipeline_config_path:
    configs = config_util.get_configs_from_pipeline_file(
        FLAGS.pipeline_config_path)
    if FLAGS.task == 0:
      tf.gfile.Copy(FLAGS.pipeline_config_path,
                    os.path.join(FLAGS.train_dir, 'pipeline.config'),
                    overwrite=True)
  else:
    configs = config_util.get_configs_from_multiple_files(
        model_config_path=FLAGS.model_config_path,
        train_config_path=FLAGS.train_config_path,
        train_input_config_path=FLAGS.input_config_path)
    if FLAGS.task == 0:
      for name, config in [('model.config', FLAGS.model_config_path),
                           ('train.config', FLAGS.train_config_path),
                           ('input.config', FLAGS.input_config_path)]:
        tf.gfile.Copy(config, os.path.join(FLAGS.train_dir, name),
                      overwrite=True)

  model_config = configs['model']
  train_config = configs['train_config']
  input_config = configs['train_input_config']

  model_fn = functools.partial(
      model_builder.build,
      model_config=model_config,
      is_training=True)

  def get_next(config):
    return dataset_builder.make_initializable_iterator(
        dataset_builder.build(config)).get_next()

  create_input_dict_fn = functools.partial(get_next, input_config)

  env = json.loads(os.environ.get('TF_CONFIG', '{}'))
  cluster_data = env.get('cluster', None)
  cluster = tf.train.ClusterSpec(cluster_data) if cluster_data else None
  task_data = env.get('task', None) or {'type': 'master', 'index': 0}
  task_info = type('TaskSpec', (object,), task_data)

  # Parameters for a single worker.
  ps_tasks = 0
  worker_replicas = 1
  worker_job_name = 'lonely_worker'
  task = 0
  is_chief = True
  master = ''

  if cluster_data and 'worker' in cluster_data:
    # Number of total worker replicas include "worker"s and the "master".
    worker_replicas = len(cluster_data['worker']) + 1
  if cluster_data and 'ps' in cluster_data:
    ps_tasks = len(cluster_data['ps'])

  if worker_replicas > 1 and ps_tasks < 1:
    raise ValueError('At least 1 ps task is needed for distributed training.')

  if worker_replicas >= 1 and ps_tasks > 0:
    # Set up distributed training.
    server = tf.train.Server(tf.train.ClusterSpec(cluster), protocol='grpc',
                             job_name=task_info.type,
                             task_index=task_info.index)
    if task_info.type == 'ps':
      server.join()
      return

    worker_job_name = '%s/task:%d' % (task_info.type, task_info.index)
    task = task_info.index
    is_chief = (task_info.type == 'master')
    master = server.target

  graph_rewriter_fn = None
  if 'graph_rewriter_config' in configs:
    graph_rewriter_fn = graph_rewriter_builder.build(
        configs['graph_rewriter_config'], is_training=True)

  trainer.train(
      create_input_dict_fn,
      model_fn,
      train_config,
      master,
      task,
      FLAGS.num_clones,
      worker_replicas,
      FLAGS.clone_on_cpu,
      ps_tasks,
      worker_job_name,
      is_chief,
      FLAGS.train_dir,
      graph_hook_fn=graph_rewriter_fn)


if __name__ == '__main__':
  tf.app.run()

【问题讨论】：

标签： python tensorflow training-data object-detection-api

【解决方案1】：

这一行应该给你一个提示：Expected [1,?,?,3], got [1,1,314,384,3] Tensorflow 使用 4D 张量作为模型的图像输入，这就是为什么需要大小为 [1,?,?,3] 的张量。但是，您提供了 5D 张量。我想，在你的代码中某处有一个tf.expand_dims()。

【讨论】：

感谢您的反馈！我一直在尝试解决这个问题很长一段时间，有一个奇怪的错误，有些图像会显示上述错误，而有些则不会。由于某种原因，当数据集包含 JPG 图像（与 JPEG 一起使用）时，Tensorflow 的对象检测 API 无法正确训练，这是在生成 TFrecords 时要做的事情。最初我的数据集由 JPG 图像组成，我试图解决这个问题，但没有尝试通过编辑 csv 文件重新标记我的数据集，但没有成功。我用 JPEG 重新完成了整个过程，它现在可以工作了！！在尝试修复它时学到了很多东西！
ARGH，我又遇到了同样的问题！似乎它已经解决了，一旦我扩展了我的数据集（全部为 JPEG），问题就卷土重来了。我尝试编辑 object_detection 目录中的 eval_util.py 文件。然而，它似乎并没有什么不同。即使我删除了 eval_util.py，它似乎也没有什么不同。因此，即使编辑张量大小也不会影响结果。我不知道怎么了，我几乎看不到其他人面临这个问题。
正如错误回溯"train.py", line 185, in <module> tf.app.run() 所示，您的错误发生在train.py 文件中。您应该提供此代码来帮助您。
我已经附上了 train.py 的全部代码。老实说，我看不出 train.py 怎么会出错，因为我没有更改它的任何部分，它与源代码完全相同。此外，错误不会在每个训练步骤中发生。顺便说一句，我真的很感谢您不断的反馈。
解决了这个问题！代码没有问题，我没有检查 CSV 文件以查看是否有任何条目的高度和宽度存储为 0。发现有，删除它们后，错误没有再次出现！感谢您的不断反馈！现在我的心可以平静了。

【解决方案2】：

对于遇到此问题的任何人，请检查您的训练和测试 CSV 文件，以查看是否有任何 Width 和 Height 为 0 的条目。如果图像与其扩展名的格式不同，通常会发生这种情况。通过删除这些图像或使用 -

将它们转换为正确格式来解决问题

img = cv2.imread(test_full_path)
cv2.imwrite(test_full_path, img, [int(cv2.IMWRITE_JPEG_QUALITY), 100])

【讨论】：