【问题标题】:issue with tensorflow, cudnn张量流问题,cudnn
【发布时间】:2021-01-29 17:11:45
【问题描述】:

我在这方面已经坚持了一段时间,如果有任何见解,我将不胜感激。我收到以下错误(我有 1 个 GPU NVIDIA Quadro P2000)并且正在使用 tf.keras 运行一个简单的 LSTM 模型。

Epoch 1/10000
2021-01-29 11:03:37.958865: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-01-29 11:03:38.161824: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-01-29 11:03:38.445486: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-01-29 11:03:38.472038: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
947/1612 [================>.............] - ETA: 35s - loss: 3207.0856 - root_mean_squared_error: 56.57472021-01-29 
11:04:29.588631: E tensorflow/stream_executor/dnn.cc:616] CUDNN_STATUS_INTERNAL_ERROR
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1859): 'cudnnRNNForwardTraining( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.handles(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2021-01-29 11:04:29.606739: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cudnn_rnn_ops.cc:1521 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 17, 128, 1, 284, 128, 128]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\user\Documents\folder\file.py", line 98, in <module>
    model.fit(
  File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1100, in fit
    tmp_logs = self.train_function(iterator)
  File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\eager\def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\eager\def_function.py", line 855, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\eager\function.py", line 2942, in __call__
    return graph_function._call_flat(
  File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\eager\function.py", line 1918, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\eager\function.py", line 555, in call
    outputs = execute.execute(
  File "C:\Users\user\Anaconda3\envs\nenv\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError:    Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 17, 128, 1, 284, 128, 128]
         [[{{node CudnnRNN}}]]
         [[sequential/bidirectional/backward_lstm/PartitionedCall]] [Op:__inference_train_function_5421]
 
Function call stack:
train_function -> train_function -> train_function

【问题讨论】:

  • CUDNN_STATUS_INTERNAL_ERROR 建议是 CUDNN 错误。可以分享源代码吗?此外,nvidia-smi 的输出可能会有所帮助。
  • @JakubBiały 我在批处理数据时添加了 drop_remainder=True 并且此错误消失了。我仍然不确定为什么添加这个可以解决问题以及为什么我不能使用 drop_remainder=False(例如,我可以在 CPU 上使用 drop_remainder=False,并且队友即使使用 GPU 也没有问题)

标签: tensorflow gpu


【解决方案1】:

我发现只有满足所有条件才会出现这个错误:

  1. 模型中使用了 CUDNN LSTM。
  2. .tfrecord 文件中加载数据。
  3. drop_remainder=False
  4. 仅在最后一批训练中发生。

设置drop_remainder=True是最直接的方法,更新cudnn可能会彻底解决。

【讨论】:

    猜你喜欢
    • 2018-06-10
    • 2021-10-27
    • 2021-08-21
    • 2016-06-17
    • 2021-02-01
    • 2021-09-03
    • 2016-10-18
    • 2021-10-30
    相关资源
    最近更新 更多