使用 scipy.optimize.minimize 实现反向传播梯度下降答案

【问题标题】：Implementing backpropagation gradient descent using scipy.optimize.minimize使用 scipy.optimize.minimize 实现反向传播梯度下降
【发布时间】：2018-05-14 18:02:55
【问题描述】：

我正在尝试使用 numpy 和 scipy 为 MNIST 数字图像数据集训练自动编码器 NN（3 层 - 2 层可见，1 层隐藏）。该实现基于here 给出的符号下面是我的代码：

def autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, data):
"""
The input theta is a 1-dimensional array because scipy.optimize.minimize expects
the parameters being optimized to be a 1d array.
First convert theta from a 1d array to the (W1, W2, b1, b2)
matrix/vector format, so that this follows the notation convention of the
lecture notes and tutorial.
You must compute the:
    cost : scalar representing the overall cost J(theta)
    grad : array representing the corresponding gradient of each element of theta
"""

  training_size = data.shape[1]
  # unroll theta to get (W1,W2,b1,b2) #
  W1 = theta[0:hidden_size*visible_size]
  W1 = W1.reshape(hidden_size,visible_size)

  W2 = theta[hidden_size*visible_size:2*hidden_size*visible_size]
  W2 = W2.reshape(visible_size,hidden_size)

  b1 = theta[2*hidden_size*visible_size:2*hidden_size*visible_size + hidden_size]
  b2 = theta[2*hidden_size*visible_size + hidden_size: 2*hidden_size*visible_size + hidden_size + visible_size]

  #feedforward pass
  a_l1 = data

  z_l2 = W1.dot(a_l1) + numpy.tile(b1,(training_size,1)).T
  a_l2 = sigmoid(z_l2)

  z_l3 = W2.dot(a_l2) + numpy.tile(b2,(training_size,1)).T
  a_l3 = sigmoid(z_l3)

  #backprop
  delta_l3 = numpy.multiply(-(data-a_l3),numpy.multiply(a_l3,1-a_l3))
  delta_l2 = numpy.multiply(W2.T.dot(delta_l3),
                             numpy.multiply(a_l2, 1 - a_l2))

  b2_derivative = numpy.sum(delta_l3,axis=1)/training_size
  b1_derivative = numpy.sum(delta_l2,axis=1)/training_size

  W2_derivative = numpy.dot(delta_l3,a_l2.T)/training_size + lambda_*W2
  #print(W2_derivative.shape)
  W1_derivative = numpy.dot(delta_l2,a_l1.T)/training_size + lambda_*W1

  W1_derivative = W1_derivative.reshape(hidden_size*visible_size)
  W2_derivative = W2_derivative.reshape(visible_size*hidden_size)
  b1_derivative = b1_derivative.reshape(hidden_size)
  b2_derivative = b2_derivative.reshape(visible_size)


  grad = numpy.concatenate((W1_derivative,W2_derivative,b1_derivative,b2_derivative))
  cost = 0.5*numpy.sum((data-a_l3)**2)/training_size + 0.5*lambda_*(numpy.sum(W1**2) + numpy.sum(W2**2))
  return cost,grad

我还实现了一个函数来估计数值梯度并验证我的实现的正确性（如下）。

def compute_gradient_numerical_estimate(J, theta, epsilon=0.0001):
"""
:param J: a loss (cost) function that computes the real-valued loss given parameters and data
:param theta: array of parameters
:param epsilon: amount to vary each parameter in order to estimate
                the gradient by numerical difference
:return: array of numerical gradient estimate
"""

  gradient = numpy.zeros(theta.shape)

  eps_vector = numpy.zeros(theta.shape)
  for i in range(0,theta.size):

      eps_vector[i] = epsilon
      cost1,grad1 = J(theta+eps_vector)
      cost2,grad2 = J(theta-eps_vector)
      gradient[i] = (cost1 - cost2)/(2*epsilon)
      eps_vector[i] = 0


  return gradient

数值估计值与函数计算值之间的差异范数约为 6.87165125021e-09，这似乎是可以接受的。我的主要问题似乎是让梯度下降算法“L-BGFGS-B”使用scipy.optimize.minimize 函数工作，如下所示：

# theta is the 1-D array of(W1,W2,b1,b2)
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
options_ = {'maxiter': 4000, 'disp': False}
result = scipy.optimize.minimize(J, theta, method='L-BFGS-B', jac=True, options=options_)

我由此得到以下输出：

scipy.optimize.minimize() details:
  fun: 90.802022224079778
 hess_inv: <16474x16474 LbfgsInvHessProduct with dtype=float64>
  jac: array([ -6.83667742e-06,  -2.74886002e-06,  -3.23531941e-06, ...,
     1.22425735e-01,   1.23425062e-01,   1.28091250e-01])
message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
 nfev: 21
  nit: 0
 status: 2
success: False
    x: array([-0.06836677, -0.0274886 , -0.03235319, ...,  0.        ,
    0.        ,  0.        ])

现在，这个post 似乎表明该错误可能意味着梯度函数实现可能是错误的？但我的数值梯度估计似乎证实了我的实现是正确的。我尝试通过使用指定的均匀分布here 来改变初始权重，但问题仍然存在。我的反向传播实现有什么问题吗？

【问题讨论】：

L-BFGS-B 不是梯度下降。（而且看起来您的优化问题严重缩放）。为了验证您的梯度，请使用 scipy 的 optimize.check_grad。
按比例，你是指选项中的maxiter参数吗？
当然不是。您的优化器甚至没有进行一次迭代。 4000 应该如何处理（特别是对于基于线搜索的具有 2 阶收敛检查的算法）？我的意思是梯度大小的差异。
我刚刚按照建议使用 scipy 的 optimize.check_grad 函数检查了梯度幅度差异。我得到的值为9.59683630072e-05。这个错误高吗？
有趣的是，如果我修改autoencoder_cost_and_grad 以仅返回成本并从最小化函数中删除jac=True，最小化算法就会成功运行。那么我的渐变一定是错误的。

标签： numpy scipy neural-network backpropagation

【解决方案1】：

原来问题是这一行的语法错误（非常愚蠢）：

J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)

我什至在函数声明中都没有lambda 参数x。因此，每当调用 J 时，甚至都没有传递 theta 数组。

这解决了它：

J = lambda x: utils.autoencoder_cost_and_grad(x, visible_size, hidden_size, lambda_, patches_train)

【讨论】：