Keras中的“Loss is NaN”错误，如何调试？答案

【问题标题】："Loss is NaN" error in Keras, how to debug?Keras中的“Loss is NaN”错误，如何调试？
【发布时间】：2021-10-12 01:38:11
【问题描述】：

我知道这里还有其他关于“Loss is NaN”的问题，但我正在使用 François Chollet（Keras 的作者）提供的示例代码，这应该是最简单的示例。我认为我的问题可能与其他人遇到的不同。

主要是我想知道我可以用这个 API 做些什么来深入了解问题所在。

这里是代码。经过几个小的修改，它直接来自 Chollet 的“使用 Python 进行深度学习”。

import os
os.environ["KERAS_BACKEND"] = "plaidml.keras.backend"

from keras.datasets import mnist
(train_images,train_labels),(test_images,test_labels) = mnist.load_data()

from keras import models
from keras import layers

network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,), name="layer1"))
network.add(layers.Dense(10, activation='softmax', name="layer2"))

network.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32')/255

from keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

network.fit(train_images, train_labels, epochs=5, batch_size=128)

test_loss, test_acc = network.evaluate(test_images, test_labels)

print('Accuracy: ', test_acc)
print('Loss: ', test_loss)

我怀疑问题可能出在后端（前两行）。

症状是我在训练网络时看到的只是“loss:nan”和 0.0987 的准确度数。

我已经阅读了其他有此错误的线程，他们建议更改优化器的一些参数或方法。例如，根据我阅读的其他线程，我已经尝试过：

from keras import optimizers
opt = optimizers.Adam(1e-2)

network.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

它不会改变症状。我在初始化程序中输入了什么参数并不重要，我尝试了从 0.01 到 1e-8 的值。相同的症状。

我在 Macbook Air 上运行。它没有 Nvidia GPU，因此无法运行 Keras 用于加速的 CUDA 库。相反，我按照我在网上找到的文章的建议安装了 OpenCL 和 PlaidML，并将最初的两行添加到脚本中。

我非常怀疑 PlaidML 的设置有问题，但它通过了安装程序提供的测试，我没有看到任何与显卡相关的警告或错误消息。

如果不是这样，我不知道会是什么。

无论如何，作为深度学习的新手，对于 Keras 来说，并且在 Python 方面还相对缺乏经验，我完全不知道我可以做些什么来尝试深入了解这里发生的事情。 Keras 有没有给我调试工具？

【问题讨论】：

我按原样运行了您的代码，并且运行完美。我认为是您的环境导致了问题，因此我建议您使用免费的在线资源，例如“google colab”或“Kaggle”。

标签： python keras deep-learning

【解决方案1】：

在过去的几天里，我学到了一些相关的东西。

这是 PlaidML 的一个已知问题，显然仍未解决。 Here是github上的讨论。
虽然我在很多地方读到 Tensorflow 只能在 Nvidia GPU 上运行，但我发现 at the Intel websiteTensorflow 可以在 Mac Intel CPU 上运行。
我在这里和那里找到的用于 Tensorflow 的 pip install 命令对我不起作用，但 conda 会起作用：“conda install tensorflow”或“conda install tensorflow -c anaconda”
由于我现在安装了 Tensorflow Keras 和 PlaidML Keras，我可以使用“from tensorflow.keras import XXX”或“from keras import XXX”来选择 CPU 或 GPU 版本。这很有用。

尚未调查：根据许多网站，如果我有一个 Thunderbolt 3 端口（一台 Mac 有，一个没有），可以将外部 Nvidia eGPU 添加到 Mac。能否奏效还不得而知。游戏玩家讨论板似乎说它不适合他们的目的，但 Keras 相关网站表示 Tensorflow 将使用 eGPU 就好了。

无论如何，我可以在 CPU 上使用 tensorflow.keras 继续我的项目。这并不理想，但它有效。在上面的测试代码中，我看到大约 100 μs/样本，6 s/epoch。

遇到 PlaidML-Keras 问题的人的底线：这是一个普遍问题。 PlaidML 在很多 GPU 上都被破坏了。现在在 Intel CPU 上使用 Tensorflow，并密切关注 Issue #168 以等待修复。

【讨论】：