【问题标题】:Tensorflow GPU memory allocationTensorFlow GPU 内存分配
【发布时间】:2020-12-29 14:33:07
【问题描述】:

我正在尝试使用我的 GPU 而不是 CPU 来训练自定义对象检测模型。我已按照以下教程中的所有说明进行操作:https://tensorflow-object-detection-api-tutorial.readthedocs.io/

我已经测试了我的软件,一切都已安装并正常运行。

目前使用:

  • Windows 10
  • 英伟达 Quadro P1000
  • Tensorflow 版本 2.4.0
  • CUDA 11.0
  • CuDNN 8.0.4
  • 预训练模型 = ssd_resnet50_v1_fpn_640x640_coco17_tpu-8
  • 要检测的类数 = 8
  • 批量:1

但问题是,在训练几秒钟后,它会停止使用 GPU,并给出以下警告消息。


2020-12-29 15:01:15.444931: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-12-29 15:01:18.923079: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2020-12-29 15:01:18.928526: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2020-12-29 15:01:19.830691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro P1000 computeCapability: 6.1
coreClock: 1.5185GHz coreCount: 4 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 89.53GiB/s
2020-12-29 15:01:19.838069: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-12-29 15:01:19.849650: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-12-29 15:01:19.854098: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-12-29 15:01:19.861632: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2020-12-29 15:01:19.867525: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2020-12-29 15:01:19.879754: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2020-12-29 15:01:19.886521: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2020-12-29 15:01:19.891603: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-12-29 15:01:19.895368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2020-12-29 15:01:19.900144: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-29 15:01:19.910485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro P1000 computeCapability: 6.1
coreClock: 1.5185GHz coreCount: 4 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 89.53GiB/s
2020-12-29 15:01:19.917796: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-12-29 15:01:19.922273: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-12-29 15:01:19.926687: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-12-29 15:01:19.930618: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2020-12-29 15:01:19.934399: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2020-12-29 15:01:19.938808: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2020-12-29 15:01:19.943155: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2020-12-29 15:01:19.947005: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-12-29 15:01:19.950826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2020-12-29 15:01:20.491701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-29 15:01:20.496963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2020-12-29 15:01:20.500990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2020-12-29 15:01:20.504027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2991 MB memory) -> physical GPU (device: 0, name: Quadro P1000, pci bus id: 0000:01:00.0, compute capability: 6.1)
2020-12-29 15:01:20.512219: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1229 15:01:20.515150  5872 mirrored_strategy.py:350] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I1229 15:01:20.515150  5872 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I1229 15:01:20.515150  5872 config_util.py:552] Maybe overwriting use_bfloat16: False
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\model_lib_v2.py:523: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W1229 15:01:20.530780  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\model_lib_v2.py:523: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['annotations/train.record']
I1229 15:01:20.546404  5872 dataset_builder.py:148] Reading unweighted datasets: ['annotations/train.record']
INFO:tensorflow:Reading record datasets for input file: ['annotations/train.record']
I1229 15:01:20.546404  5872 dataset_builder.py:77] Reading record datasets for input file: ['annotations/train.record']
INFO:tensorflow:Number of filenames to read: 1
I1229 15:01:20.546404  5872 dataset_builder.py:78] Number of filenames to read: 1
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
W1229 15:01:20.546404  5872 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\builders\dataset_builder.py:103: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
W1229 15:01:20.546404  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\builders\dataset_builder.py:103: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\builders\dataset_builder.py:222: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
W1229 15:01:20.562029  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\builders\dataset_builder.py:222: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
W1229 15:01:25.685788  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
`seed2` arg is deprecated.Use sample_distorted_bounding_box_v2 instead.
W1229 15:01:27.908942  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
`seed2` arg is deprecated.Use sample_distorted_bounding_box_v2 instead.
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\inputs.py:281: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W1229 15:01:29.229117  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\inputs.py:281: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
2020-12-29 15:01:31.781125: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\keras\backend.py:434: UserWarning: `tf.keras.backend.set_learning_phase` is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.
  warnings.warn('`tf.keras.backend.set_learning_phase` is deprecated and '
2020-12-29 15:01:48.972736: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-12-29 15:01:49.258182: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-12-29 15:01:49.287771: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-12-29 15:01:49.822205: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0

2020-12-29 15:01:49.866004: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0

WARNING:tensorflow:Unresolved object in checkpoint: (root).model._groundtruth_lists
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._groundtruth_lists
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._batched_prediction_tensor_names
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._batched_prediction_tensor_names
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._box_prediction_head
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._box_prediction_head
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._prediction_heads
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._prediction_heads
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._sorted_head_names
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._sorted_head_names
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._head_scope_conv_layers
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._head_scope_conv_layers
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._box_prediction_head._box_encoder_layers
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._box_prediction_head._box_encoder_layers
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._prediction_heads.class_predictions_with_background
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._prediction_heads.class_predictions_with_background
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.0
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.0
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.1
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.1
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.2
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.2
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.3
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.3
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.4
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.4
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.moving_variance
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.axis
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.axis
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.gamma
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.gamma
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.beta
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.beta
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_mean
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_mean
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_variance
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_variance
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W1229 15:01:53.076874  5872 util.py:169] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.484427  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.484427  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.484427  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py:605: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
W1229 15:01:59.423827 15152 deprecation.py:537] From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py:605: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
2020-12-29 15:02:11.320699: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.73GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:11.351326: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.74GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:11.751709: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.08GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:11.784850: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:12.607912: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:12.644507: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.15GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:13.057969: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:13.092341: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:13.299573: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.04GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:13.331704: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

此外,我没有在我的设备上运行任何其他程序,因此内存不足似乎有点奇怪。

【问题讨论】:

    标签: python windows tensorflow gpu


    【解决方案1】:

    我看到这个错误消息有四个不同的原因,有不同的解决方案:

    1。你内存不足

    也许您的 GPU 内存已满,当 TensorFlow 进行初始化并且您的计算图最终使用了您物理设备的所有内存时,就会出现此问题。解决方案是在 GPU 选项中使用 allow growth = True。如果为 GPU 启用了内存增长,则运行时初始化不会分配设备上的所有内存。导入后使用以下代码 sn-p 可能会解决您的问题。

    import tensorflow as tf
    physical_devices = tf.config.experimental.list_physical_devices('GPU')
    if len(physical_devices) > 0:
        tf.config.experimental.set_memory_growth(physical_devices[0], True)
    

    2。您有缓存问题

    我经常通过关闭我的 python 进程、删除 ~/.nv 目录(在 linux 上,rm -rf ~/.nv)并重新启动 Python 进程来解决此错误。我不完全知道为什么会这样。它可能至少部分与第二个选项有关:

    3。使用 Keras 时,Keras 层(类)是直接从 keras 导入的,而不是 tensorflow.keras

    Keras 包含在上述 TensorFlow 2.0 中。所以

    删除 import keras 和 将 from keras.module.module import class 语句替换为 --> 从 tensorflow.keras.module.module import class

    例如 代替 from keras.layers import Conv3D,ConvLSTM2D,Conv3DTranspose, Input 有了这个: from tensorflow.keras.layers import Conv3D,ConvLSTM2D,Conv3DTranspose, Input

    3。您的 CUDA、TensorFlow、NVIDIA 驱动程序等版本不兼容。

    如果您从未使用过类似的模型,您的 VRAM 并没有用完,您的导入正确,如步骤 3 中所述,并且您的缓存是干净的,我会返回并使用设置 CUDA + TensorFlow最佳可用安装指南 - 我在遵循https://www.tensorflow.org/install/gpu 的说明而不是 NVIDIA / CUDA 站点上的说明方面取得了最大的成功。 Lambda Stack:https://lambdalabs.com/lambda-stack-deep-learning-software 也是不错的选择。

    【讨论】:

    • 我已经尝试了所有选项,甚至重新安装了 CUDA 11.0 和 CuDNN 8.0.4,但仍然无法正常工作......
    • 请检查您的 GPU 驱动版本,它必须是 450 另一个解决方案是请尝试安装 cuda toolkit 10.1 和 cudnn 7
    • 我已经安装了带有 CUDA 10.1 的 tensorflow 2.3 和 CUDDN 7.6.5,但我仍然遇到相同类型的错误“(1) 资源耗尽:分配具有形状的张量时的 OOM [8,256, 80,80] 并通过分配器 GPU_0_bfc"在 /job:localhost/replica:0/task:0/device:GPU:0 上键入 float"
    • 现在问题是为您的张量分配内存,我认为您的 GPU 的 VRAM 非常低,我建议您知道的是,要么降低数据类型的大小,例如 float32,要么float16 而不是 float64,其他解决方案是降低浴缸尺寸。
    猜你喜欢
    • 2017-11-13
    • 1970-01-01
    • 2021-11-01
    • 2016-03-15
    • 2018-11-25
    • 2017-11-14
    • 2019-10-28
    相关资源
    最近更新 更多