【问题标题】:Why does simple gradient descent diverge?为什么简单的梯度下降会发散?
【发布时间】:2016-12-28 01:24:19
【问题描述】:

这是我在一个变量中实现梯度下降的第二次尝试,它总是发散的。有什么想法吗?

这是一种简单的线性回归,用于最小化一个变量中的残差平方和。

def gradient_descent_wtf(xvalues, yvalues):
    tolerance = 0.1

    #y=mx+b
    #some line to predict y values from x values
    m=1.
    b=1.

    #a predicted y-value has value mx + b

    for i in range(0,10):

        #calculate y-value predictions for all x-values
        predicted_yvalues = list()
        for x in xvalues:
            predicted_yvalues.append(m*x + b)

        # predicted_yvalues holds the predicted y-values

        #now calculate the residuals = y-value - predicted y-value for each point
        residuals = list()
        number_of_points = len(yvalues)
        for n in range(0,number_of_points):
            residuals.append(yvalues[n] - predicted_yvalues[n])

        ## calculate the residual sum of squares from the residuals, that is,
        ## square each residual and add them all up. we will try to minimize
        ## the residual sum of squares later.
        residual_sum_of_squares = 0.
        for r in residuals:
            residual_sum_of_squares += r**2
        print("RSS = %s" % residual_sum_of_squares)
        ##
        ##
        ##

        #now make a version of the residuals which is multiplied by the x-values
        residuals_times_xvalues = list()
        for n in range(0,number_of_points):
            residuals_times_xvalues.append(residuals[n] * xvalues[n])

        #now create the sums for the residuals and for the residuals times the x-values
        residuals_sum = sum(residuals)

        residuals_times_xvalues_sum = sum(residuals_times_xvalues)

        # now multiply the sums by a positive scalar and add each to m and b.

        residuals_sum *= 0.1
        residuals_times_xvalues_sum *= 0.1

        b += residuals_sum
        m += residuals_times_xvalues_sum

        #and repeat until convergence.
        #convergence occurs when ||sum vector|| < some tolerance.
        # ||sum vector|| = sqrt( residuals_sum**2 + residuals_times_xvalues_sum**2 )

        #check for convergence
        magnitude_of_sum_vector = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
        if magnitude_of_sum_vector < tolerance:
            break

    return (b, m)

结果:

gradient_descent_wtf([1,2,3,4,5,6,7,8,9,10],[6,23,8,56,3,24,234,76,59,567])
RSS = 370433.0
RSS = 300170125.7
RSS = 4.86943013045e+11
RSS = 7.90447409339e+14
RSS = 1.28312217794e+18
RSS = 2.08287421094e+21
RSS = 3.38110045417e+24
RSS = 5.48849288217e+27
RSS = 8.90939341376e+30
RSS = 1.44624932026e+34
Out[108]:
(-3.475524066284303e+16, -2.4195981188763203e+17)

【问题讨论】:

    标签: python regression gradient


    【解决方案1】:

    梯度是巨大的——因此您要跟踪长距离的大向量(大数字的 0.1 倍是大的)。在适当的方向上找到单位向量。像这样的东西(用理解替换你的循环):

    def gradient_descent_wtf(xvalues, yvalues):
        tolerance = 0.1
    
        m=1.
        b=1.
    
        for i in range(0,10):
            predicted_yvalues = [m*x+b for x in xvalues]
    
            residuals = [y-y_hat for y,y_hat in zip(yvalues,predicted_yvalues)]
    
            residual_sum_of_squares = sum(r**2 for r in residuals) #only needed for debugging purposes
            print("RSS = %s" % residual_sum_of_squares)
    
            residuals_times_xvalues = [r*x for r,x in zip(residuals,xvalues)]
    
            residuals_sum = sum(residuals)
    
            residuals_times_xvalues_sum = sum(residuals_times_xvalues)
    
            # (residuals_sum,residual_times_xvalues_sum) is a vector which points in the negative
            # gradient direction. *Find a unit vector which points in same direction*
    
            magnitude = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
    
            residuals_sum /= magnitude
            residuals_times_xvalues_sum /= magnitude
    
            b += residuals_sum * (0.1)
            m += residuals_times_xvalues_sum * (0.1)
    
            #check for convergence -- this needs work!
            magnitude_of_sum_vector = (residuals_sum**2 + residuals_times_xvalues_sum**2)**0.5
            if magnitude_of_sum_vector < tolerance:
                break
    
        return (b, m)
    

    例如:

    >>> gradient_descent_wtf([1,2,3,4,5,6,7,8,9,10],[6,23,8,56,3,24,234,76,59,567])
    RSS = 370433.0
    RSS = 368732.1655050716
    RSS = 367039.18363896786
    RSS = 365354.0543519137
    RSS = 363676.7775934381
    RSS = 362007.3533123621
    RSS = 360345.7814567845
    RSS = 358692.061974069
    RSS = 357046.1948108295
    RSS = 355408.17991291644
    (1.1157111313023558, 1.9932828425473605)
    

    这当然更合理。

    制作一个数值稳定的梯度下降算法并非易事。您可能想参考一本不错的数值分析教科书。

    【讨论】:

      【解决方案2】:

      首先,您的代码是正确的。

      但是当你进行线性回归时,你应该考虑一些关于数学的事情。

      例如,残差是-205.8,你的学习率是0.1,所以你会得到一个巨大的下降步-25.8

      这是一个很大的步骤,您无法回到正确的 mb。你必须让你的步子足够小。

      有两种方法可以使梯度下降步长合理:

      1. 初始化一个小的学习率,例如 0.001 和 0.0003。
      2. 将步长除以输入值的总量。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2012-08-17
        • 1970-01-01
        • 2021-01-25
        • 2021-11-07
        • 2014-04-30
        • 2020-02-29
        相关资源
        最近更新 更多