具有 L-BFGS-B 方法 maxiter 属性的 scipy.optimize.minimize 函数不起作用答案

【问题标题】：scipy.optimize.minimize function with L-BFGS-B method maxiter attribute not working具有 L-BFGS-B 方法 maxiter 属性的 scipy.optimize.minimize 函数不起作用
【发布时间】：2017-06-13 17:37:51
【问题描述】：

我有一个简单的成本函数，我想使用 scipy.optimize.minimize 函数对其进行优化。

opt_solution  = scipy.optimize.minimize(costFunction, theta, args = (training_data,), method = 'L-BFGS-B', jac = True, options = {'maxiter': 100)

其中costFunction是要优化的函数，theta是要优化的参数。在costFunction 中，我打印了成本函数的值。但是参数maxiter是否我将值从10增加到100000似乎没有效果。它所花费的时间是相同的。另外，我期望成本函数的打印值应该等于maxiter 的值。所以我觉得maxiter 没有效果。可能是什么问题？成本函数是

def costFunction(self, theta, input):

    """ Extract weights and biases from 'theta' input """

    W1 = theta[self.limit0 : self.limit1].reshape(self.hidden_size, self.visible_size)
    W2 = theta[self.limit1 : self.limit2].reshape(self.visible_size, self.hidden_size)
    b1 = theta[self.limit2 : self.limit3].reshape(self.hidden_size, 1)
    b2 = theta[self.limit3 : self.limit4].reshape(self.visible_size, 1)

    """ Compute output layers by performing a feedforward pass
        Computation is done for all the training inputs simultaneously """

    hidden_layer = self.sigmoid(numpy.dot(W1, input) + b1)
    output_layer = self.sigmoid(numpy.dot(W2, hidden_layer) + b2)

    """ Compute intermediate difference values using Backpropagation algorithm """

    diff = output_layer - input
    sum_of_squares_error = 0.5 * numpy.sum(numpy.multiply(diff, diff)) / input.shape[1]
    weight_decay         = 0.5 * self.lamda * (numpy.sum(numpy.multiply(W1, W1)) + numpy.sum(numpy.multiply(W2, W2)))
    cost                 = sum_of_squares_error + weight_decay 

    """ Compute the gradient values by averaging partial derivatives
        Partial derivatives are averaged over all training examples """

    W1_grad = numpy.dot(del_hid, numpy.transpose(input))
    W2_grad = numpy.dot(del_out, numpy.transpose(hidden_layer))
    b1_grad = numpy.sum(del_hid, axis = 1)
    b2_grad = numpy.sum(del_out, axis = 1)

    W1_grad = W1_grad / input.shape[1] + self.lamda * W1
    W2_grad = W2_grad / input.shape[1] + self.lamda * W2
    b1_grad = b1_grad / input.shape[1]
    b2_grad = b2_grad / input.shape[1]

    """ Transform numpy matrices into arrays """

    W1_grad = numpy.array(W1_grad)
    W2_grad = numpy.array(W2_grad)
    b1_grad = numpy.array(b1_grad)
    b2_grad = numpy.array(b2_grad)

    """ Unroll the gradient values and return as 'theta' gradient """

    theta_grad = numpy.concatenate((W1_grad.flatten(), W2_grad.flatten(),
                                    b1_grad.flatten(), b2_grad.flatten()))
    # Update counter value
    self.counter += 1                                
    print "Index ", self.counter, "cost ", cost
    return [cost, theta_grad]

【问题讨论】：

你的成本函数是什么？
成本函数是简单的均方误差，如(x-x')^2。

标签： python optimization scipy

【解决方案1】：

maxiter 给出了 scipy 在放弃改进解决方案之前将尝试的最大迭代次数。但它很可能对解决方案感到满意并提前停止。

如果您查看the docs for minimize when using the 'l-bfgs-b' method，请注意您可以将三个参数作为选项传递（factr、ftol 和gtol），它们也会导致迭代停止。

在像您这样的简单情况下，特别是如果您的成本函数还提供梯度（如您的调用中的 jac=True 所示），收敛通常发生在前几次迭代中，因此在达到 maxiter 限制之前。

【讨论】：

好的。我明白了。可能是这样。还有一件奇怪的事情是：调用成本函数的次数不应该等于maxiter 值？就我而言，这是不同的价值观。我正在打印此值以进行验证。
如果你给它渐变，我认为它们应该非常接近，虽然在设置过程中可能会有一些额外的调用，如果不查看实际的源代码就很难分辨。
我现在用成本函数编辑了我的代码。功能很长。但我希望它会更容易理解。