【发布时间】:2022-01-26 01:12:32
【问题描述】:
我正在尝试在 tensorflow 中训练模型,我的代码运行良好,但在训练阶段突然开始崩溃。我尝试了多个“修复”...从复制 cuda .dll 文件到在导入后插入以下代码,但无济于事。
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
这是编译模型时弹出的错误:
tensorflow/stream_executor/cuda/cuda_driver.cc:794] failed to alloc 4294967296 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2021-12-26 10:53:00.265328:
模型架构:
Model: "sequential"
_______________________________________________________________
Layer (type) Output Shape Param #
=================================================================
resizing (Resizing) (None, 128, 128, 1) 0
_________________________________________________________________
normalization (Normalization (None, 128, 128, 1) 3
_________________________________________________________________
conv2d (Conv2D) (None, 128, 128, 32) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 64, 64, 32) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 64, 64, 64) 18496
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 32, 32, 64) 0
_________________________________________________________________
dropout (Dropout) (None, 32, 32, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 65536) 0
_________________________________________________________________
dense (Dense) (None, 216) 14155992
_________________________________________________________________
dropout_1 (Dropout) (None, 216) 0
_________________________________________________________________
dense_1 (Dense) (None, 36) 7812
=================================================================
Total params: 14,182,623
Trainable params: 14,182,620
Non-trainable params: 3
_______________________________________________________________
以及训练开始时出现的错误:(我已经裁剪出“无法分配内存”的重复日志)
2021-12-26 10:54:08.890289: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2021-12-26 10:54:08.891029: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2021-12-26 10:54:08.899859: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 1 GPUs
2021-12-26 10:54:08.933109: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cupti64_112.dll'; dlerror: cupti64_112.dll not found
2021-12-26 10:54:08.947342: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cupti.dll'; dlerror: cupti.dll not found
2021-12-26 10:54:08.948462: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1666] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.
2021-12-26 10:54:08.956260: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2021-12-26 10:54:08.958977: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1757] function cupti_interface_->Finalize()failed with error CUPTI could not be loaded or symbol could not be found.
Epoch 1/50
2021-12-26 10:54:11.849166: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8204
2021-12-26 10:54:13.674500: W tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 144.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-12-26 10:54:13.674920: W tensorflow/core/kernels/gpu_utils.cc:49] Failed to allocate memory for convolution redzone checking; skipping this check. This is benign and only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats:
Limit: 519487488
InUse: 515392000
MaxInUse: 515465728
NumAllocs: 23263
MaxAllocSize: 134217728
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2021-12-26 10:54:29.147248: W tensorflow/core/common_runtime/bfc_allocator.cc:468] ***********xxxxxxxxxx**********************************************************************xxxxxxxxx
2021-12-26 10:54:29.151731: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at pooling_ops_common.cc:225 : Resource exhausted: OOM when allocating tensor with shape[64,64,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "train.py", line 196, in <module>
callbacks=[tf.keras.callbacks.ModelCheckpoint("models/net7_e50", monitor="val_loss", verbose=1, save_freq="epoch"), tf.keras.callbacks.TensorBoard("./logs/net7e50")],
File "<Project_Directory>\venv\lib\site-packages\keras\engine\training.py", line 1184, in fit
tmp_logs = self.train_function(iterator)
File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 885, in __call__
result = self._call(*args, **kwds)
File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 950, in _call
return self._stateless_fn(*args, **kwds)
File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\function.py", line 3040, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\function.py", line 1964, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\function.py", line 596, in call
ctx=ctx)
File "<Project_Directory>\venv\lib\site-packages\tensorflow\python\eager\execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[64,64,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node sequential/max_pooling2d_1/MaxPool (defined at train.py:196) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
[Op:__inference_train_function_1432]
Function call stack:
train_function
任何帮助将不胜感激!
【问题讨论】:
-
由于内存不足,您是否尝试过:如果原因是内存碎片,可能环境变量 'TF_GPU_ALLOCATOR=cuda_malloc_async' 会改善这种情况。
-
@JJ 尝试设置 TF_GPU_ALLOCATOR 时出现以下错误:
NameError: name 'cuda_malloc_async' is not defined -
TF_GPU_ALLOCATOR 是环境变量,不是 python 代码。
-
重要的是,您到底在训练什么模型以及在哪个 GPU 上训练?
-
使用dotenv设置环境变量并访问它们。
标签: python tensorflow keras deep-learning