【问题标题】:pytorch's autograd.detect_anomaly equivalent in tensorflowpytorch 在 tensorflow 中的 autograd.detect_anomaly 等效项
【发布时间】:2021-11-29 16:21:18
【问题描述】:

我正在尝试调试在大约 30 个时期后突然产生 NaN 损失的 tensorflow 代码。您可能会在 SO question 中找到我的具体问题和尝试过的事情。

我在训练期间监控了每个 mini-batch 的所有层的权重,发现尽管在上一次迭代中所有权重值都小于 1(我已将 kernel_constraintmax_norm 设置为 1),但权重突然跳到了 NaN。这使得很难确定哪个操作是罪魁祸首。

Pytorch 有一个很酷的调试方法torch.autograd.detect_anomaly,它会在任何产生 NaN 值的反向计算中产生错误并显示回溯。这使得调试代码变得容易。

TensorFlow 中有类似的东西吗?如果不能,您能建议一种调试方法吗?

【问题讨论】:

    标签: python tensorflow machine-learning gradient


    【解决方案1】:

    tensorflow中确实有类似的调试工具。见tf.debugging.check_numerics

    这可用于跟踪在训练期间产生infnan 值的张量。一旦找到这样的值,tensorflow 就会生成一个InvalidArgumentError

    tf.debugging.check_numerics(LayerN, "LayerN is producing nans!")
    

    如果张量 LayerN 有 nans,你会得到这样的错误:

    Traceback (most recent call last):
      File "trainer.py", line 506, in <module>
        worker.train_model()
      File "trainer.py", line 211, in train_model
        l, tmae = train_step(*batch)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
        result = self._call(*args, **kwds)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 855, in _call
        return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
      File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in __call__
        filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
      File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
        ctx, args, cancellation_manager=cancellation_manager))
      File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
        ctx=ctx)
      File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
        inputs, attrs, num_outputs)
    tensorflow.python.framework.errors_impl.InvalidArgumentError:  LayerN is producing nans! : Tensor had NaN values
    

    【讨论】:

      猜你喜欢
      • 2019-01-02
      • 2022-10-12
      • 2021-07-04
      • 2020-10-30
      • 2021-08-01
      • 2020-03-28
      • 1970-01-01
      • 2021-01-31
      • 1970-01-01
      相关资源
      最近更新 更多