TensorFlow GPU 内存分配答案

【问题标题】：Tensorflow GPU memory allocationTensorFlow GPU 内存分配
【发布时间】：2020-12-29 14:33:07
【问题描述】：

我正在尝试使用我的 GPU 而不是 CPU 来训练自定义对象检测模型。我已按照以下教程中的所有说明进行操作：https://tensorflow-object-detection-api-tutorial.readthedocs.io/

我已经测试了我的软件，一切都已安装并正常运行。

目前使用：

Windows 10
英伟达 Quadro P1000
Tensorflow 版本 2.4.0
CUDA 11.0
CuDNN 8.0.4
预训练模型 = ssd_resnet50_v1_fpn_640x640_coco17_tpu-8
要检测的类数 = 8
批量：1

但问题是，在训练几秒钟后，它会停止使用 GPU，并给出以下警告消息。


2020-12-29 15:01:15.444931: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-12-29 15:01:18.923079: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2020-12-29 15:01:18.928526: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2020-12-29 15:01:19.830691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro P1000 computeCapability: 6.1
coreClock: 1.5185GHz coreCount: 4 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 89.53GiB/s
2020-12-29 15:01:19.838069: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-12-29 15:01:19.849650: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-12-29 15:01:19.854098: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-12-29 15:01:19.861632: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2020-12-29 15:01:19.867525: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2020-12-29 15:01:19.879754: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2020-12-29 15:01:19.886521: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2020-12-29 15:01:19.891603: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-12-29 15:01:19.895368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2020-12-29 15:01:19.900144: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-29 15:01:19.910485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro P1000 computeCapability: 6.1
coreClock: 1.5185GHz coreCount: 4 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 89.53GiB/s
2020-12-29 15:01:19.917796: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-12-29 15:01:19.922273: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-12-29 15:01:19.926687: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-12-29 15:01:19.930618: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2020-12-29 15:01:19.934399: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2020-12-29 15:01:19.938808: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2020-12-29 15:01:19.943155: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2020-12-29 15:01:19.947005: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-12-29 15:01:19.950826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2020-12-29 15:01:20.491701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-29 15:01:20.496963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2020-12-29 15:01:20.500990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2020-12-29 15:01:20.504027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2991 MB memory) -> physical GPU (device: 0, name: Quadro P1000, pci bus id: 0000:01:00.0, compute capability: 6.1)
2020-12-29 15:01:20.512219: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1229 15:01:20.515150  5872 mirrored_strategy.py:350] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I1229 15:01:20.515150  5872 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I1229 15:01:20.515150  5872 config_util.py:552] Maybe overwriting use_bfloat16: False
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\model_lib_v2.py:523: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W1229 15:01:20.530780  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\model_lib_v2.py:523: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['annotations/train.record']
I1229 15:01:20.546404  5872 dataset_builder.py:148] Reading unweighted datasets: ['annotations/train.record']
INFO:tensorflow:Reading record datasets for input file: ['annotations/train.record']
I1229 15:01:20.546404  5872 dataset_builder.py:77] Reading record datasets for input file: ['annotations/train.record']
INFO:tensorflow:Number of filenames to read: 1
I1229 15:01:20.546404  5872 dataset_builder.py:78] Number of filenames to read: 1
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
W1229 15:01:20.546404  5872 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\builders\dataset_builder.py:103: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
W1229 15:01:20.546404  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\builders\dataset_builder.py:103: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\builders\dataset_builder.py:222: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
W1229 15:01:20.562029  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\builders\dataset_builder.py:222: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
W1229 15:01:25.685788  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
`seed2` arg is deprecated.Use sample_distorted_bounding_box_v2 instead.
W1229 15:01:27.908942  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
`seed2` arg is deprecated.Use sample_distorted_bounding_box_v2 instead.
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\inputs.py:281: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W1229 15:01:29.229117  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\inputs.py:281: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
2020-12-29 15:01:31.781125: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\keras\backend.py:434: UserWarning: `tf.keras.backend.set_learning_phase` is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.
  warnings.warn('`tf.keras.backend.set_learning_phase` is deprecated and '
2020-12-29 15:01:48.972736: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-12-29 15:01:49.258182: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-12-29 15:01:49.287771: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-12-29 15:01:49.822205: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0

2020-12-29 15:01:49.866004: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0

WARNING:tensorflow:Unresolved object in checkpoint: (root).model._groundtruth_lists
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._groundtruth_lists
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._batched_prediction_tensor_names
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._batched_prediction_tensor_names
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._box_prediction_head
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._box_prediction_head
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._prediction_heads
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._prediction_heads
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._sorted_head_names
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._sorted_head_names
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._head_scope_conv_layers
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._head_scope_conv_layers
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._box_prediction_head._box_encoder_layers
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._box_prediction_head._box_encoder_layers
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._prediction_heads.class_predictions_with_background
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._prediction_heads.class_predictions_with_background
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.0
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.0
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.1
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.1
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.2
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.2
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.3
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.3
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.4
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.4
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.moving_variance
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.axis
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.axis
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.gamma
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.gamma
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.beta
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.beta
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_mean
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_mean
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_variance
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_variance
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W1229 15:01:53.076874  5872 util.py:169] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.484427  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.484427  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.484427  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py:605: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
W1229 15:01:59.423827 15152 deprecation.py:537] From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py:605: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
2020-12-29 15:02:11.320699: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.73GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:11.351326: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.74GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:11.751709: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.08GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:11.784850: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:12.607912: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:12.644507: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.15GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:13.057969: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:13.092341: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:13.299573: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.04GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:13.331704: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

此外，我没有在我的设备上运行任何其他程序，因此内存不足似乎有点奇怪。

【问题讨论】：

标签： python windows tensorflow gpu

【解决方案1】：

我看到这个错误消息有四个不同的原因，有不同的解决方案：

1。你内存不足

也许您的 GPU 内存已满，当 TensorFlow 进行初始化并且您的计算图最终使用了您物理设备的所有内存时，就会出现此问题。解决方案是在 GPU 选项中使用 allow growth = True。如果为 GPU 启用了内存增长，则运行时初始化不会分配设备上的所有内存。导入后使用以下代码 sn-p 可能会解决您的问题。

import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
if len(physical_devices) > 0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

2。您有缓存问题

我经常通过关闭我的 python 进程、删除 ~/.nv 目录（在 linux 上，rm -rf ~/.nv）并重新启动 Python 进程来解决此错误。我不完全知道为什么会这样。它可能至少部分与第二个选项有关：

3。使用 Keras 时，Keras 层（类）是直接从 keras 导入的，而不是 tensorflow.keras

Keras 包含在上述 TensorFlow 2.0 中。所以

删除 import keras 和将 from keras.module.module import class 语句替换为 --> 从 tensorflow.keras.module.module import class

例如代替 from keras.layers import Conv3D,ConvLSTM2D,Conv3DTranspose, Input 有了这个： from tensorflow.keras.layers import Conv3D,ConvLSTM2D,Conv3DTranspose, Input

3。您的 CUDA、TensorFlow、NVIDIA 驱动程序等版本不兼容。

如果您从未使用过类似的模型，您的 VRAM 并没有用完，您的导入正确，如步骤 3 中所述，并且您的缓存是干净的，我会返回并使用设置 CUDA + TensorFlow最佳可用安装指南 - 我在遵循https://www.tensorflow.org/install/gpu 的说明而不是 NVIDIA / CUDA 站点上的说明方面取得了最大的成功。 Lambda Stack：https://lambdalabs.com/lambda-stack-deep-learning-software 也是不错的选择。

【讨论】：

我已经尝试了所有选项，甚至重新安装了 CUDA 11.0 和 CuDNN 8.0.4，但仍然无法正常工作......
请检查您的 GPU 驱动版本，它必须是 450 另一个解决方案是请尝试安装 cuda toolkit 10.1 和 cudnn 7
我已经安装了带有 CUDA 10.1 的 tensorflow 2.3 和 CUDDN 7.6.5，但我仍然遇到相同类型的错误“(1) 资源耗尽：分配具有形状的张量时的 OOM [8,256, 80,80] 并通过分配器 GPU_0_bfc"在 /job:localhost/replica:0/task:0/device:GPU:0 上键入 float"
现在问题是为您的张量分配内存，我认为您的 GPU 的 VRAM 非常低，我建议您知道的是，要么降低数据类型的大小，例如 float32，要么float16 而不是 float64，其他解决方案是降低浴缸尺寸。