【发布时间】:2021-01-29 17:11:45
【问题描述】:
我在这方面已经坚持了一段时间,如果有任何见解,我将不胜感激。我收到以下错误(我有 1 个 GPU NVIDIA Quadro P2000)并且正在使用 tf.keras 运行一个简单的 LSTM 模型。
Epoch 1/10000
2021-01-29 11:03:37.958865: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-01-29 11:03:38.161824: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-01-29 11:03:38.445486: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-01-29 11:03:38.472038: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
947/1612 [================>.............] - ETA: 35s - loss: 3207.0856 - root_mean_squared_error: 56.57472021-01-29
11:04:29.588631: E tensorflow/stream_executor/dnn.cc:616] CUDNN_STATUS_INTERNAL_ERROR
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1859): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2021-01-29 11:04:29.606739: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cudnn_rnn_ops.cc:1521 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 17, 128, 1, 284, 128, 128]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\user\Documents\folder\file.py", line 98, in <module>
model.fit(
File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1100, in fit
tmp_logs = self.train_function(iterator)
File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\eager\def_function.py", line 828, in __call__
result = self._call(*args, **kwds)
File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\eager\def_function.py", line 855, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable
File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\eager\function.py", line 2942, in __call__
return graph_function._call_flat(
File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\eager\function.py", line 1918, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\eager\function.py", line 555, in call
outputs = execute.execute(
File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 17, 128, 1, 284, 128, 128]
[[{{node CudnnRNN}}]]
[[sequential/bidirectional/backward_lstm/PartitionedCall]] [Op:__inference_train_function_5421]
Function call stack:
train_function -> train_function -> train_function
【问题讨论】:
-
CUDNN_STATUS_INTERNAL_ERROR建议是 CUDNN 错误。可以分享源代码吗?此外,nvidia-smi的输出可能会有所帮助。 -
@JakubBiały 我在批处理数据时添加了 drop_remainder=True 并且此错误消失了。我仍然不确定为什么添加这个可以解决问题以及为什么我不能使用 drop_remainder=False(例如,我可以在 CPU 上使用 drop_remainder=False,并且队友即使使用 GPU 也没有问题)
标签: tensorflow gpu