(Stochastic) Gradient Descent implementation in Python答案

【问题标题】：(Stochastic) Gradient Descent implementation in Python(Stochastic) Gradient Descent implementation in Python
【发布时间】：2022-12-02 01:01:07
【问题描述】：

I am trying to do (preferably Stochastic) Gradient Descent to minimize a custom loss function. I tried using scikit learn SGDRegressor class. However, SGDRegressor doesn't seem to allow me to minimize a custom loss function without data, and if I can use custom loss function, I can only use it as regression to fit data with fit() method.

Is there a way to use scikit implementation or any other Python implementation of stochastic gradient descent to minimize a custom function without data?

【问题讨论】：

What do you mean without data? This sounds like standard usage of Keras and TensorFlow, where autodiff computed gradients of your custom loss for you.

标签： python tensorflow keras scikit-learn gradient-descent

【解决方案1】：

Implementation of Basic Gradient Descent Now that you know how the basic gradient descent works, you can implement it in Python. You’ll use only plain Python and NumPy, which enables you to write concise code when working with arrays (or vectors) and gain a performance boost.

This is a basic implementation of the algorithm that starts with an arbitrary point, start, iteratively moves it toward the minimum, and returns a point that is hopefully at or near the minimum:

def gradient_descent(gradient, start, learn_rate, n_iter):
    vector = start
    for _ in range(n_iter):
        diff = -learn_rate * gradient(vector)
        vector += diff
    return vector

gradient_descent() takes four arguments:

gradient is the function or any Python callable object that takes a vector and returns the gradient of the function you’re trying to minimize. start is the point where the algorithm starts its search, given as a sequence (tuple, list, NumPy array, and so on) or scalar (in the case of a one-dimensional problem). learn_rate is the learning rate that controls the magnitude of the vector update. n_iter is the number of iterations.

This function does exactly what’s described above: it takes a starting point (line 2), iteratively updates it according to the learning rate and the value of the gradient (lines 3 to 5), and finally returns the last position found.

Before you apply gradient_descent(), you can add another termination criterion:

import numpy as np

def gradient_descent(
    gradient, start, learn_rate, n_iter=50, tolerance=1e-06):
    vector = start
    for _ in range(n_iter):
        diff = -learn_rate * gradient(vector)
        if np.all(np.abs(diff) <= tolerance):
            break
        vector += diff
    return vector

You now have the additional parameter tolerance (line 4), which specifies the minimal allowed movement in each iteration. You’ve also defined the default values for tolerance and n_iter, so you don’t have to specify them each time you call gradient_descent().

Lines 9 and 10 enable gradient_descent() to stop iterating and return the result before n_iter is reached if the vector update in the current iteration is less than or equal to tolerance. This often happens near the minimum, where gradients are usually very small. Unfortunately, it can also happen near a local minimum or a saddle point.

Line 9 uses the convenient NumPy functions numpy.all() and numpy.abs() to compare the absolute values of diff and tolerance in a single statement. That’s why you import numpy on line 1.

Now that you have the first version of gradient_descent(), it’s time to test your function. You’ll start with a small example and find the minimum of the function ? = ?².

This function has only one independent variable (?), and its gradient is the derivative 2?. It’s a differentiable convex function, and the analytical way to find its minimum is straightforward. However, in practice, analytical differentiation can be difficult or even impossible and is often approximated with numerical methods.

You need only one statement to test your gradient descent implementation:

>>> gradient_descent(
...     gradient=lambda v: 2 * v, start=10.0, learn_rate=0.2)
2.210739197207331e-06

You use the lambda function lambda v: 2 * v to provide the gradient of ?². You start from the value 10.0 and set the learning rate to 0.2. You get a result that’s very close to zero, which is the correct minimum.

The figure below shows the movement of the solution through the iterations:

enter link description here

You start from the rightmost green dot (? = 10) and move toward the minimum (? = 0). The updates are larger at first because the value of the gradient (and slope) is higher. As you approach the minimum, they become lower.

Improvement of the Code You can make gradient_descent() more robust, comprehensive, and better-looking without modifying its core functionality:

import numpy as np

def gradient_descent(
    gradient, x, y, start, learn_rate=0.1, n_iter=50, tolerance=1e-06,
    dtype="float64"):
    # Checking if the gradient is callable
    if not callable(gradient):
        raise TypeError("'gradient' must be callable")

    # Setting up the data type for NumPy arrays
    dtype_ = np.dtype(dtype)

    # Converting x and y to NumPy arrays
    x, y = np.array(x, dtype=dtype_), np.array(y, dtype=dtype_)
    if x.shape[0] != y.shape[0]:
        raise ValueError("'x' and 'y' lengths do not match")

    # Initializing the values of the variables
    vector = np.array(start, dtype=dtype_)

    # Setting up and checking the learning rate
    learn_rate = np.array(learn_rate, dtype=dtype_)
    if np.any(learn_rate <= 0):
        raise ValueError("'learn_rate' must be greater than zero")

    # Setting up and checking the maximal number of iterations
    n_iter = int(n_iter)
    if n_iter <= 0:
        raise ValueError("'n_iter' must be greater than zero")

    # Setting up and checking the tolerance
    tolerance = np.array(tolerance, dtype=dtype_)
    if np.any(tolerance <= 0):
        raise ValueError("'tolerance' must be greater than zero")

    # Performing the gradient descent loop
    for _ in range(n_iter):
        # Recalculating the difference
        diff = -learn_rate * np.array(gradient(x, y, vector), dtype_)

        # Checking if the absolute difference is small enough
        if np.all(np.abs(diff) <= tolerance):
            break

        # Updating the values of the variables
        vector += diff

    return vector if vector.shape else vector.item()

【讨论】：