【问题标题】:Segmentation Fault training Deeplab with Cityscapes使用 Cityscapes 进行分割故障训练 Deeplab
【发布时间】:2020-03-15 06:12:06
【问题描述】:

我目前正在执行 deeplab 在 Cityscapes 数据集上训练 exception_65 主干的步骤,但不幸的是我遇到了分段错误。我无法重现该错误。例如。 PASCAL 数据集的训练效果很好。我检查了 tensorflow 和驱动程序的路径和几个版本以及组合等。即使我在没有 GPU 支持的情况下运行 train.py 脚本,我也会遇到相同的分段错误。我在另一台电脑上做了同样的步骤,我工作了。谁知道问题出在哪里?

我的设置:

  • Ubuntu 18.04
  • NVIDIA RTX 2080 驱动程序版本 430.65(随 .run 文件一起安装)
  • CUDA 10.0(随 .run 文件安装)
  • cudnn 7.6.5
  • Python 3.6
  • 张量流 1.15

通过运行:

python3 "${WORK_DIR}"/train.py \
  --logtostderr \
  --training_number_of_steps=${NUM_ITERATIONS} \
  --train_split="train_fine" \
  --model_variant="xception_65" \
  --atrous_rates=6 \
  --atrous_rates=12 \
  --atrous_rates=18 \
  --output_stride=16 \
  --decoder_output_stride=4 \
  --train_crop_size="769,769" \
  --train_batch_size=1 \
  --fine_tune_batch_norm=False \
  --dataset="cityscapes" \
  --tf_initial_checkpoint="${INIT_FOLDER}/deeplabv3_cityscapes_train/model.ckpt" \
  --train_logdir="${TRAIN_LOGDIR}" \
  --dataset_dir="${CITYSCAPES_DATASET}" 

我得到以下输出:

I1119 16:52:49.856512 139832269989696 learning.py:768] Starting Queues.
Fatal Python error: Segmentation fault

Thread 0x00007f2cd086b700 (most recent call first):
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/threading.py", line 296 in wait
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/queue.py", line 170 in get
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/threading.py", line 926 in _bootstrap_inner
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/threading.py", line 890 in _bootstrap

Thread 0x00007f2d3cc7e740 (most recent call first):
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443 in _call_tf_sessionrun
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350 in _run_fn
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365 in _do_call
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359 in _do_run
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180 in _run
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956 in run
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 490 in train_step
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/contrib/slim/python/slim/learning.py", line 775 in train
  File "/home/kuschnig/tensorflow/models/research/deeplab/train.py", line 466 in main
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/absl/app.py", line 250 in _run_main
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/absl/app.py", line 299 in run
  File "/home/kuschnig/anaconda3/envs/conda-tf/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40 in run
  File "/home/kuschnig/tensorflow/models/research/deeplab/train.py", line 472 in <module>
Segmentation fault (core dumped)

使用 gdb 的回溯显示: GDB Output

【问题讨论】:

    标签: python tensorflow deeplab


    【解决方案1】:

    我遇到了与描述相同的问题。我通过做两件事成功地解决了它:

    1. 确保您的 tfrecord 的名称(对我来说它们被命名为 train-00000-of-00010.tfrecord)与 --train_split="train" 相同。
    2. data_generator.py 的更改,在第 72 行 splits_to_sizes={'train_fine': 2975 附近由 splits_to_sizes={'train': 2975 更改。

    诀窍是在启动培训的.shdata_generator.pytfrecord 文件夹中使用相同的名称(对我来说是train)。

    【讨论】:

      【解决方案2】:

      我的问题看起来像你的问题,我意识到 --dataset_dir 应该指向包含城市景观 tfrecord 数据的目录,而不是城市景观目录本身。

      在data_generator中检索数据的代码。

      def _get_all_files(self):
          """Gets all the files to read data from.
      
          Returns:
            A list of input files.
          """
          file_pattern = _FILE_PATTERN
          file_pattern = os.path.join(self.dataset_dir,
                                      file_pattern % self.split_name)
          return tf.gfile.Glob(file_pattern)
      

      【讨论】:

      • 应用这个解决方案似乎解决了我的问题。
      【解决方案3】:

      我仍然不知道是什么导致了分段错误,但我的解决方案是在 data_generator.py 中为城市景观指定一个新数据集

      【讨论】:

        猜你喜欢
        • 2018-08-21
        • 2013-10-31
        • 2020-02-04
        • 2015-05-22
        • 2021-08-11
        • 2021-09-27
        • 2020-06-07
        • 2019-09-10
        • 2011-01-25
        相关资源
        最近更新 更多