model_main.py 更快-rcnn CUDA_ERROR_OUT_OF_MEMORY答案

【问题标题】：model_main.py faster-rcnn CUDA_ERROR_OUT_OF_MEMORYmodel_main.py 更快-rcnn CUDA_ERROR_OUT_OF_MEMORY
【发布时间】：2020-06-29 03:01:30
【问题描述】：

说明：

我可以使用 legacy/train.py 训练 fast-rcnn 模型，但是当我尝试使用 model_main.py 以相同的配置设置进行训练时遇到如下问题。图像分辨率：1920x1080

tensorflow/stream_executor/cuda/cuda_driver.cc:890] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
.\tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592

tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (256):     Total Chunks: 4753, Chunks in use: 4753. 1.16MiB allocated for chunks. 1.16MiB in use in bin. 144.3KiB client-requested in use in bin.

tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0000000203800000 next 1 of size 256

我尝试过的：

将批量大小设置为 1
使用内存增长

config = tf.ConfigProto()

config.gpu_options.allow_growth = True

会话 = tf.Session(config=config)

或

session_config = tf.ConfigProto()

session_config.gpu_options.allow_growth = True

config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, session_config=session_config, log_step_count_steps=10, save_summary_steps=20, keep_checkpoint_max=20, save_checkpoints_steps=100)

不要分配整个 GPU 内存

config = tf.ConfigProto()

config.gpu_options.per_process_gpu_memory_fraction = 0.6

会话 = tf.Session(config=config)

或

session_config = tf.ConfigProto()

session_config.gpu_options.per_process_gpu_memory_fraction = 0.6

config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, session_config=session_config, log_step_count_steps=10, save_summary_steps=20, keep_checkpoint_max=20, save_checkpoints_steps=100)

TensorFlow CUDA_ERROR_OUT_OF_MEMORY

queue_capacity、min_after_dequeue、num_readers、batch_queue_capacity、num_batch_queue_threads、prefetch_queue_capacity的设置

Out Of Memory when training on Big Images

将 min_dimension、max_dimension 降低到 270、480

这些都不适合我。

环境：

操作系统平台和发行版：Win 10 专业版：1909
TensorFlow 安装自：pip tensorflow-gpu
TensorFlow 1.14 版
对象检测：0.1 CUDA/cuDNN 版本：Cuda 10.0、Cudnn 10.0
GPU 型号和内存：NVIDIA GeForce RTX 2070 SUPER，内存 8 G
系统内存：32G

我的配置：

# Faster R-CNN with Inception v2, configured for Oxford-IIIT Pets Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  faster_rcnn {
    num_classes: 2
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 1080
        max_dimension: 1920
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_v2'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0002
          schedule {
            step: 900000
            learning_rate: .00002
          }
          schedule {
            step: 1200000
            learning_rate: .000002
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: ""
  from_detection_checkpoint: true
  load_all_detection_checkpoint_vars: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  batch_queue_capacity: 60
  num_batch_queue_threads: 30
  prefetch_queue_capacity: 40
}


train_input_reader: {
  tf_record_input_reader {
    input_path: "D:\\object_detection\\train_data\\train.record"
  }
  label_map_path: "D:\\object_detection\\pascal_label_map.pbtxt"
  queue_capacity: 2
  min_after_dequeue: 1
  num_readers: 1
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  num_examples: 1101
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "D:\\object_detection\\eval_data\\eval.record"
  }
  label_map_path: "D:\\object_detection\\pascal_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

如果有其他解决方案，我将非常感谢您。

【问题讨论】：

标签： tensorflow out-of-memory object-detection object-detection-api faster-rcnn

【解决方案1】：

对象检测模型会消耗大量内存。这是因为它们的工作方式以及它们为找到框而生成的大量锚点。

您做得很好，但您的 GPU 不足以训练这类模型。你可以做的事情：

缩小图片尺寸，比如 720x512
使用 SGD 作为优化器，而不是使用其他优化器，例如 Adam。 SGD 消耗的内存大约是 Adam 的 3 倍。

另外值得一提的是，您在 1 个实例的小批量方面做得很好。如果我没记错的话，FasterRCNN 每批只训练 2 张图像

【讨论】：

感谢您的回复。我也怀疑这是因为 fast-rcnn 消耗大量内存，但我可以使用 legacy/train.py 训练模型（具有相同的配置设置，例如批量大小）。因此，我认为我的 gpu 应该可以训练这个模型。另一种可能性是因为 model_main.py 将同时运行训练和评估，因此它比 legacy/train.py 消耗更多的内存。但我必须深入研究代码来检查。忘了说我也试过把图片缩小到480x270，但是几步之后还是会遇到OOM。我稍后会尝试 SGD，干杯。

【解决方案2】：

我刚刚发现如果我将batch_size设置为3，那么它可以正常工作。当我将batch_size设置回1时，会遇到OOM问题。

这很奇怪，我仍然不知道为什么，因为它应该总是以较小的批处理大小来节省内存。

如果你遇到同样的情况，可以尝试稍微增加batch size，但我不能保证它会起作用。

【讨论】：