【问题标题】:Training keeps stopping. Tuple error. (Tensorflow Object_detection API)训练不断停止。元组错误。 (Tensorflow Object_detection API)
【发布时间】:2019-02-24 13:22:42
【问题描述】:

我正在使用 tensorflow 的对象检测 API,每当我执行训练时,它会在几次迭代后停止。最初我的图像是 jpg 格式的,我从中创建了转换为 CSV 的 XML 文件,但是,人们提到错误的原因可能是使用 jpg 而不是 jpeg(尽管其他人已经让它以 jpg 格式工作)。然后我将我的图像转换为 jpeg 并执行其余步骤,然后进行培训,同样的问题出现了。我一直在这个问题上停留了很长时间无济于事,而且似乎没有很多可行的解决方案。如果有人有解决此问题的想法,我将不胜感激。代码如下

Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:root:Variable [Conv/biases/Momentum] is not available in checkpoint
WARNING:root:Variable [Conv/weights/Momentum] is not available in checkpoint
WARNING:root:Variable [FirstStageBoxPredictor/BoxEncodingPredictor/biases/Momentum] is not available in checkpoint
WARNING:root:Variable [FirstStageBoxPredictor/BoxEncodingPredictor/weights/Momentum] is not available in checkpoint

....

    INFO:tensorflow:global step 1: loss = 1.6760 (13.660 sec/step)
INFO:tensorflow:global step 1: loss = 1.6760 (13.660 sec/step)
INFO:tensorflow:Finished training! Saving model to disk.
INFO:tensorflow:Finished training! Saving model to disk.
/usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened.
  warnings.warn("Attempting to use a closed FileWriter. "
Traceback (most recent call last):
  File "train.py", line 185, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "train.py", line 181, in main
    graph_hook_fn=graph_rewriter_fn)
  File "/usr/local/lib/python3.6/dist-packages/object_detection-0.1-py3.6.egg/object_detection/legacy/trainer.py", line 416, in train
    saver=saver)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 785, in train
    ignore_live_threads=ignore_live_threads)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/supervisor.py", line 832, in stop
    ignore_live_threads=ignore_live_threads)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 257, in _run
    enqueue_callable()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1257, in _single_operation_run
    self._call_tf_sessionrun(None, {}, [], target_list, None)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape mismatch in tuple component 18. Expected [1,?,?,3], got [1,1,314,384,3]
     [[{{node batch/padding_fifo_queue_enqueue}}]]

Train.py

    """Training executable for detection models.

This executable is used to train DetectionModels. There are two ways of
configuring the training job:

1) A single pipeline_pb2.TrainEvalPipelineConfig configuration file
can be specified by --pipeline_config_path.

Example usage:
    ./train \
        --logtostderr \
        --train_dir=path/to/train_dir \
        --pipeline_config_path=pipeline_config.pbtxt

2) Three configuration files can be provided: a model_pb2.DetectionModel
configuration file to define what type of DetectionModel is being trained, an
input_reader_pb2.InputReader file to specify what training data will be used and
a train_pb2.TrainConfig file to configure training parameters.

Example usage:
    ./train \
        --logtostderr \
        --train_dir=path/to/train_dir \
        --model_config_path=model_config.pbtxt \
        --train_config_path=train_config.pbtxt \
        --input_config_path=train_input_config.pbtxt
"""
#changed  object_detection.builders/legacy/utils to builders...

import functools
import json
import os
import tensorflow as tf

from builders import dataset_builder
from builders import graph_rewriter_builder
from builders import model_builder
from legacy import trainer
from utils import config_util

tf.logging.set_verbosity(tf.logging.INFO)

flags = tf.app.flags
flags.DEFINE_string('master', '', 'Name of the TensorFlow master to use.')
flags.DEFINE_integer('task', 0, 'task id')
flags.DEFINE_integer('num_clones', 1, 'Number of clones to deploy per worker.')
flags.DEFINE_boolean('clone_on_cpu', False,
                     'Force clones to be deployed on CPU.  Note that even if '
                     'set to False (allowing ops to run on gpu), some ops may '
                     'still be run on the CPU if they have no GPU kernel.')
flags.DEFINE_integer('worker_replicas', 1, 'Number of worker+trainer '
                     'replicas.')
flags.DEFINE_integer('ps_tasks', 0,
                     'Number of parameter server tasks. If None, does not use '
                     'a parameter server.')
flags.DEFINE_string('train_dir', '',
                    'Directory to save the checkpoints and training summaries.')

flags.DEFINE_string('pipeline_config_path', '',
                    'Path to a pipeline_pb2.TrainEvalPipelineConfig config '
                    'file. If provided, other configs are ignored')

flags.DEFINE_string('train_config_path', '',
                    'Path to a train_pb2.TrainConfig config file.')
flags.DEFINE_string('input_config_path', '',
                    'Path to an input_reader_pb2.InputReader config file.')
flags.DEFINE_string('model_config_path', '',
                    'Path to a model_pb2.DetectionModel config file.')

FLAGS = flags.FLAGS


@tf.contrib.framework.deprecated(None, 'Use object_detection/model_main.py.')
def main(_):

  assert FLAGS.train_dir, '`train_dir` is missing.'
  if FLAGS.task == 0: tf.gfile.MakeDirs(FLAGS.train_dir)
  if FLAGS.pipeline_config_path:
    configs = config_util.get_configs_from_pipeline_file(
        FLAGS.pipeline_config_path)
    if FLAGS.task == 0:
      tf.gfile.Copy(FLAGS.pipeline_config_path,
                    os.path.join(FLAGS.train_dir, 'pipeline.config'),
                    overwrite=True)
  else:
    configs = config_util.get_configs_from_multiple_files(
        model_config_path=FLAGS.model_config_path,
        train_config_path=FLAGS.train_config_path,
        train_input_config_path=FLAGS.input_config_path)
    if FLAGS.task == 0:
      for name, config in [('model.config', FLAGS.model_config_path),
                           ('train.config', FLAGS.train_config_path),
                           ('input.config', FLAGS.input_config_path)]:
        tf.gfile.Copy(config, os.path.join(FLAGS.train_dir, name),
                      overwrite=True)

  model_config = configs['model']
  train_config = configs['train_config']
  input_config = configs['train_input_config']

  model_fn = functools.partial(
      model_builder.build,
      model_config=model_config,
      is_training=True)

  def get_next(config):
    return dataset_builder.make_initializable_iterator(
        dataset_builder.build(config)).get_next()

  create_input_dict_fn = functools.partial(get_next, input_config)

  env = json.loads(os.environ.get('TF_CONFIG', '{}'))
  cluster_data = env.get('cluster', None)
  cluster = tf.train.ClusterSpec(cluster_data) if cluster_data else None
  task_data = env.get('task', None) or {'type': 'master', 'index': 0}
  task_info = type('TaskSpec', (object,), task_data)

  # Parameters for a single worker.
  ps_tasks = 0
  worker_replicas = 1
  worker_job_name = 'lonely_worker'
  task = 0
  is_chief = True
  master = ''

  if cluster_data and 'worker' in cluster_data:
    # Number of total worker replicas include "worker"s and the "master".
    worker_replicas = len(cluster_data['worker']) + 1
  if cluster_data and 'ps' in cluster_data:
    ps_tasks = len(cluster_data['ps'])

  if worker_replicas > 1 and ps_tasks < 1:
    raise ValueError('At least 1 ps task is needed for distributed training.')

  if worker_replicas >= 1 and ps_tasks > 0:
    # Set up distributed training.
    server = tf.train.Server(tf.train.ClusterSpec(cluster), protocol='grpc',
                             job_name=task_info.type,
                             task_index=task_info.index)
    if task_info.type == 'ps':
      server.join()
      return

    worker_job_name = '%s/task:%d' % (task_info.type, task_info.index)
    task = task_info.index
    is_chief = (task_info.type == 'master')
    master = server.target

  graph_rewriter_fn = None
  if 'graph_rewriter_config' in configs:
    graph_rewriter_fn = graph_rewriter_builder.build(
        configs['graph_rewriter_config'], is_training=True)

  trainer.train(
      create_input_dict_fn,
      model_fn,
      train_config,
      master,
      task,
      FLAGS.num_clones,
      worker_replicas,
      FLAGS.clone_on_cpu,
      ps_tasks,
      worker_job_name,
      is_chief,
      FLAGS.train_dir,
      graph_hook_fn=graph_rewriter_fn)


if __name__ == '__main__':
  tf.app.run()

【问题讨论】:

    标签: python tensorflow training-data object-detection-api


    【解决方案1】:

    这一行应该给你一个提示:Expected [1,?,?,3], got [1,1,314,384,3] Tensorflow 使用 4D 张量作为模型的图像输入,这就是为什么需要大小为 [1,?,?,3] 的张量。但是,您提供了 5D 张量。我想,在你的代码中某处有一个tf.expand_dims()

    【讨论】:

    • 感谢您的反馈!我一直在尝试解决这个问题很长一段时间,有一个奇怪的错误,有些图像会显示上述错误,而有些则不会。由于某种原因,当数据集包含 JPG 图像(与 JPEG 一起使用)时,Tensorflow 的对象检测 API 无法正确训练,这是在生成 TFrecords 时要做的事情。最初我的数据集由 JPG 图像组成,我试图解决这个问题,但没有尝试通过编辑 csv 文件重新标记我的数据集,但没有成功。我用 JPEG 重新完成了整个过程,它现在可以工作了!!在尝试修复它时学到了很多东西!
    • ARGH,我又遇到了同样的问题!似乎它已经解决了,一旦我扩展了我的数据集(全部为 JPEG),问题就卷土重来了。我尝试编辑 object_detection 目录中的 eval_util.py 文件。然而,它似乎并没有什么不同。即使我删除了 eval_util.py,它似乎也没有什么不同。因此,即使编辑张量大小也不会影响结果。我不知道怎么了,我几乎看不到其他人面临这个问题。
    • 正如错误回溯"train.py", line 185, in &lt;module&gt; tf.app.run() 所示,您的错误发生在train.py 文件中。您应该提供此代码来帮助您。
    • 我已经附上了 train.py 的全部代码。老实说,我看不出 train.py 怎么会出错,因为我没有更改它的任何部分,它与源代码完全相同。此外,错误不会在每个训练步骤中发生。顺便说一句,我真的很感谢您不断的反馈。
    • 解决了这个问题!代码没有问题,我没有检查 CSV 文件以查看是否有任何条目的高度和宽度存储为 0。发现有,删除它们后,错误没有再次出现!感谢您的不断反馈!现在我的心可以平静了。
    【解决方案2】:

    对于遇到此问题的任何人,请检查您的训练和测试 CSV 文件,以查看是否有任何 Width 和 Height 为 0 的条目。如果图像与其扩展名的格式不同,通常会发生这种情况。通过删除这些图像或使用 -

    将它们转换为正确格式来解决问题
    img = cv2.imread(test_full_path)
    cv2.imwrite(test_full_path, img, [int(cv2.IMWRITE_JPEG_QUALITY), 100])
    

    【讨论】:

      猜你喜欢
      • 2018-07-28
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-06-27
      • 2019-05-08
      相关资源
      最近更新 更多