TensorFlow 在尝试训练模型时崩溃答案

【问题标题】：TensorFlow crashing when trying to train modelTensorFlow 在尝试训练模型时崩溃
【发布时间】：2022-01-26 01:12:32
【问题描述】：

我正在尝试在 tensorflow 中训练模型，我的代码运行良好，但在训练阶段突然开始崩溃。我尝试了多个“修复”...从复制 cuda .dll 文件到在导入后插入以下代码，但无济于事。

physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

这是编译模型时弹出的错误：

tensorflow/stream_executor/cuda/cuda_driver.cc:794] failed to alloc 4294967296 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2021-12-26 10:53:00.265328:

模型架构：

Model: "sequential"
_______________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
resizing (Resizing)          (None, 128, 128, 1)       0
_________________________________________________________________
normalization (Normalization (None, 128, 128, 1)       3
_________________________________________________________________
conv2d (Conv2D)              (None, 128, 128, 32)      320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 64, 64, 32)        0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 64, 64, 64)        18496
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 32, 32, 64)        0
_________________________________________________________________
dropout (Dropout)            (None, 32, 32, 64)        0
_________________________________________________________________
flatten (Flatten)            (None, 65536)             0
_________________________________________________________________
dense (Dense)                (None, 216)               14155992
_________________________________________________________________
dropout_1 (Dropout)          (None, 216)               0
_________________________________________________________________
dense_1 (Dense)              (None, 36)                7812
=================================================================
Total params: 14,182,623
Trainable params: 14,182,620
Non-trainable params: 3
_______________________________________________________________

以及训练开始时出现的错误：（我已经裁剪出“无法分配内存”的重复日志）

2021-12-26 10:54:08.890289: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-12-26 10:54:08.891029: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-12-26 10:54:08.899859: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2021-12-26 10:54:08.933109: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cupti64_112.dll'; dlerror: cupti64_112.dll not found
2021-12-26 10:54:08.947342: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cupti.dll'; dlerror: cupti.dll not found
2021-12-26 10:54:08.948462: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1666] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-12-26 10:54:08.956260: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-12-26 10:54:08.958977: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1757] function cupti_interface_->Finalize()failed with error CUPTI could not be loaded or symbol could not be found.
Epoch 1/50
2021-12-26 10:54:11.849166: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8204
2021-12-26 10:54:13.674500: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 144.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-12-26 10:54:13.674920: W tensorflow/core/kernels/gpu_utils.cc:49] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.

tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats:
Limit:                       519487488
InUse:                       515392000
MaxInUse:                    515465728
NumAllocs:                       23263
MaxAllocSize:                134217728
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

2021-12-26 10:54:29.147248: W tensorflow/core/common_runtime/bfc_allocator.cc:468] ***********xxxxxxxxxx**********************************************************************xxxxxxxxx
2021-12-26 10:54:29.151731: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at pooling_ops_common.cc:225 : Resource exhausted: OOM when allocating tensor with shape[64,64,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "train.py", line 196, in <module>
    callbacks=[tf.keras.callbacks.ModelCheckpoint("models/net7_e50", monitor="val_loss", verbose=1, save_freq="epoch"), tf.keras.callbacks.TensorBoard("./logs/net7e50")],
  File "<Project_Directory>\venv\lib\site-packages\keras\engine\training.py", line 1184, in fit
    tmp_logs = self.train_function(iterator)
  File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 885, in __call__
    result = self._call(*args, **kwds)
  File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\function.py", line 3040, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\function.py", line 1964, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\function.py", line 596, in call
    ctx=ctx)
  File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[64,64,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node sequential/max_pooling2d_1/MaxPool (defined at train.py:196) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_1432]

Function call stack:
train_function

任何帮助将不胜感激！

【问题讨论】：

由于内存不足，您是否尝试过：如果原因是内存碎片，可能环境变量 'TF_GPU_ALLOCATOR=cuda_malloc_async' 会改善这种情况。
@JJ 尝试设置 TF_GPU_ALLOCATOR 时出现以下错误：NameError: name 'cuda_malloc_async' is not defined
TF_GPU_ALLOCATOR 是环境变量，不是 python 代码。
重要的是，您到底在训练什么模型以及在哪个 GPU 上训练？
使用dotenv设置环境变量并访问它们。

标签： python tensorflow keras deep-learning

【解决方案1】：

（@史努比博士转述）

“1 GB GPU RAM 非常少，您的代码正在尝试分配 4 GB GPU RAM，这不是您可以通过一些 DLL 或环境变量来解决的问题，您需要使您的代码使用的 GPU 显着减少内存。”

【讨论】：