在我的梯度检查实现中，这些可接受的梯度差异是什么？答案

【问题标题】：Are these acceptable gradient differences in my gradient checking implementation?在我的梯度检查实现中，这些可接受的梯度差异是什么？
【发布时间】：2020-09-23 12:33:14
【问题描述】：

我正在构建一个带有几个 FC 层的 CNN，以预测图像中描绘的类别。

架构：

X -> CNN -> ReLU -> POOL -> FC -> ReLU -> FC -> SOFTMAX -> Y_hat

我正在实施梯度检查以检查我的梯度下降实施是否正确。我读到可接受的差异大约是 10e-9。下面的差异看起来可以接受吗？

Epoch: 0
Cost: 2.8568426944476157
Numerical Grad           Computed Grad
-5.713070134419862e-11   -6.616929226765933e-11
-5.979710331310053e-11   -6.94999613415348e-11
-5.87722383797037e-11    -6.816769371198461e-11
-5.948114792212038e-11   -6.905587213168474e-11
-5.756886551189494e-11   -6.683542608243442e-11
-5.995452767971952e-11   -6.94999613415348e-11
-5.772401095738584e-11   -6.705747068735946e-11
-5.5480026579651e-11     -6.439293542825908e-11
-5.8138150324971285e-11  -6.727951529228449e-11
-5.76037967235867e-11    -6.683542608243442e-11

作为参考，这是我的梯度检查实现：

def gradient_check(self, layer):
    # get grads from layer
    grads = layer.backward_cache['dW']
    # flatten layer W
    shape = layer.W.shape
    W_flat = layer.W.flatten()

    epsilon = 0.001

    print('Numerical Grad', 'Computed Grad')
    # loop through first few W's
    for i in range(0, 10):
        W_initial = W_flat[i]
        W_plus = W_initial + epsilon
        W_minus = W_initial - epsilon

        W_flat[i] = W_plus
        layer.W = W_flat.reshape(shape)
        cost_plus = self.compute_cost(self.forward_propogate())

        W_flat[i] = W_minus
        layer.W = W_flat.reshape(shape)
        cost_minus = self.compute_cost(self.forward_propogate())

        computed_grad = (cost_plus - cost_minus) / (2 * epsilon)

        print(grads.flatten()[i], computed_grad)

        # reset layers W's
        W_flat[i] = W_initial
        layer.W = W_flat.reshape(shape)

    return layer

【问题讨论】：

1e-11 基本上是 0，所以我认为您正在检查的数据没有什么意义（如果所有“真实”梯度都低于有趣的精度，同样可接受的代码将是“返回 0 ")
数据是一组图像，每个图像代表 7 个类别中的一个。我之前训练过模型以准确预测类（仅在训练集上）。在这种情况下，您能解释一下“毫无意义”的意思吗？谢谢。
您所指的输出中呈现的梯度非常小，以至于同样正确的梯度估计将输出 0。我猜你的学习率对于处理这个问题来说是巨大的，或者在训练这些梯度之后可能会增长；或者这些是退化的 10 维，而其余的值更高。无论哪种方式 - 为了提出的问题（检查梯度计算的数值精度，一直输出 1e-11 阶值的东西将很难检查估计）
作为序言，我是 ML 和 CNN 等更复杂事物的新手。我正在尝试从头开始构建自己的。我注意到，如果我在初始 CNN 之后移除 ReLU 激活，训练速度会快很多。我发布了上面的架构供参考。
而不是打印梯度的每个维度 - 测量梯度之间的 L2 距离（或其他规范/统计，如最小/最大误差等）。这将为您提供更合适的图片。

标签： python machine-learning gradient-descent

【解决方案1】：

在研究了梯度接近于零的原因后，我发现我的网络可能存在梯度平台问题。对此的解决方案是添加以下一项或全部：动量、RMS prop 或 Adam 优化。我将尝试实现 Adam 优化，因为它封装了动量和 RMS prop，如果可行，我会将我的答案标记为正确。

后续编辑：不幸的是，当我实现 Adam 时，这只导致爆炸梯度。即使学习率非常小 1e-5。通过添加另外两个 conv->relu->pool 层，我确实在增加数值梯度方面取得了一些进展。但无论哪种方式，梯度计算似乎都不正确。问题一定是我的反向传播实现。

【讨论】：

【解决方案2】：

您可以使用这个公式来查看这些数字之间的相对误差：

diff = (|grads - computed_grad|)/(|grads| + |computed_grad|)

如果实现正确，预计小于 1e-7。

见：https://towardsdatascience.com/how-to-debug-a-neural-network-with-gradient-checking-41deec0357a9

【讨论】：