【问题标题】:Segmented Least Squares分段最小二乘
【发布时间】:2014-03-15 09:37:17
【问题描述】:

给出一个算法,它将平面中的一系列点 (x_1, y_1), (x_2, y_2), ...., (x_n, y_n) 和一个整数 k 作为输入并返回最佳分段线性函数f 由最多 k 个片段组成,使平方和误差最小化。您可以假设您可以访问一种算法,该算法通过 Θ(n) 时间内的一组 n 个点计算一个段的平方和误差。解决方案应使用 O(n^2k) 时间和 O(nk) 空间。

谁能帮我解决这个问题?非常感谢!

【问题讨论】:

  • O(n^(2k))O(n^2 * k)?
  • 另外,函数必须是连续的吗?
  • O(n^2 * k),函数不需要连续。
  • 我认为 k = o(n) 是不可能的,因为您可能需要为输入的每个中缀(子字符串/连续子序列)至少计算最小二乘。仅这些就花费了您的 Ω(n^3) OP 和您的 O(n) 算法黑盒。问题的其余部分暗示像我提出的那样的 DP 解决方案是所需的解决方案,所以 O(n^2 * k) 可能是问题设置者的错误,或者您遗漏了一些信息?

标签: algorithm dynamic-programming


【解决方案1】:

如果您可以对n^2 中的某些段进行最小二乘,那么使用动态编程很容易在n^2 k^2 中执行您想要的操作。您可能只能将其优化为单个 k

【讨论】:

    【解决方案2】:

    (这对你的家庭作业来说太晚了,但希望它对你有帮助。)
    首先是 python / numpy 中的动态编程,仅适用于k = 4, 帮助您了解动态编程的工作原理; 一旦你理解了这一点,为任何 k 编写一个循环应该很容易。
    此外,Cost[] 是一个二维矩阵,空间 O(n^2); 下到空间 O(n k) 见文末注释

    #!/usr/bin/env python
    """ split4.py: min-cost split into 4 pieces, dynamic programming k=4 """
    
    from __future__ import division
    import numpy as np
    
    __version__ = "2014-03-09 mar denis"
    
    #...............................................................................
    def split4( Cost, verbose=1 ):
        """ split4.py: min-cost split into 4 pieces, dynamic programming k=4
            min Cost[0:a] + Cost[a:b] + Cost[b:c] + Cost[c:n]
            Cost[a,b] = error in least-squares line fit to xy[a] .. xy[b] *including b*
            or error in lsq horizontal lines, sum (y_j - av y) ^2 for each piece --
    
              o--
            o-
                 o---
                     o----
            | |  |   |
            0 2  5   9
    
            (Why 4 ? to walk through step by step, then put in a loop)
        """
            # speedup: maxlen 2 n/k or so
    
        Cost = np.asanyarray(Cost)
        n = Cost.shape[1]
            # C2 C3 ... costs, J2 J3 ... indices of best splits
        J2 = - np.ones(n, dtype=int)  # -1, NaN mark undefined / bug
        C2 = np.ones(n) * np.NaN
        J3 = - np.ones(n, dtype=int)
        C3 = np.ones(n) * np.NaN
    
            # best 2-splits of the left 2 3 4 ...
        for nleft in range( 1, n ):
            J2[nleft] = j = np.argmin([ Cost[0,j-1] + Cost[j,nleft]  for j in range( 1, nleft+1 )]) + 1
            C2[nleft] =                 Cost[0,j-1] + Cost[j,nleft]
                # an idiom for argmin j, min value c together
    
            # best 3-splits of the left 3 4 5 ...
        for nleft in range( 2, n ):
            J3[nleft] = j = np.argmin([ C2[j-1] + Cost[j,nleft]  for j in range( 2, nleft+1 )]) + 2
            C3[nleft] =                 C2[j-1] + Cost[j,nleft]
    
            # best 4-split of all n --
        j4 = np.argmin([ C3[j-1] + Cost[j,n-1]  for j in range( 3, n )]) + 3
        c4 =             C3[j4-1] + Cost[j4,n-1]
        j3 = J3[j4]
        j2 = J2[j3]
        jsplit = np.array([ 0, j2, j3, j4, n ])
        if verbose:
            print "split4: len %s  pos %s  cost %.3g" % (np.diff(jsplit), jsplit, c4)
            print "split4: J2 %s  C2 %s" %(J2, C2)
            print "split4: J3 %s  C3 %s" %(J3, C3)
        return jsplit
    
    #...............................................................................
    if __name__ == "__main__":
        import random
        import sys
        import spread
    
        n = 10
        ncycle = 2
        plot = 0
        seed = 0
    
            # run this.py a=1 b=None c=[3] 'd = expr' ...  in sh or ipython
        for arg in sys.argv[1:]:
            exec( arg )
        np.set_printoptions( 1, threshold=100, edgeitems=10, linewidth=100, suppress=True )
        np.random.seed(seed)
        random.seed(seed)
    
        print "\n", 80 * "-"
        title = "Dynamic programming least-square horizontal lines  %s  n %d  seed %d" % (
                __file__, n, seed)
        print title
    
        x = np.arange( n + 0. )
        y = np.sin( 2*np.pi * x * ncycle / n )
            # synthetic time series ?
        print "y: %s  av %.3g  variance %.3g" % (y, y.mean(), np.var(y))
    
        print "Cost[j,k] = sum (y - av y)^2 --"  # len * var y[j:k+1]
        Cost = spread.spreads_allij( y )
        print Cost  # .round().astype(int)
    
        jsplit = split4( Cost )
            # split4: len [3 2 3 2]  pos [ 0  3  5  8 10]
    
        if plot:
            import matplotlib.pyplot as pl
            title += "\n lengths: %s" % np.diff(jsplit)
            pl.title( title )
            pl.plot( y )
            for js, js1 in zip( jsplit[:-1], jsplit[1:] ):
                if js1 <= js:  continue
                yav = y[js:js1].mean() * np.ones( js1 - js + 1 )
                pl.plot( np.arange( js, js1 + 1 ), yav )
            # pl.legend()
            pl.show()
    

    然后,以下代码仅对水平线执行Cost[],斜率为0; 在 O(n) 时间内将其扩展到任何斜率的线段,作为练习。

    """ spreads( all y[:j] ) in time O(n)
    
    define spread( y[] ) = sum (y - average y)^2
        e.g. spread of 24 hourly temperatures y[0:24] i.e. y[0] .. y[23]
        around a horizontal line at the average temperature
            (spread = 0 for constant temperature,
            24 c^2 for constant + [c -c c -c ...],
            24 * variance(y) )
    
        How fast can one compute all 24 spreads
        1 hour (midnight to 1 am), 2 hours ... all 24 ?
    
        A simpler problem: compute all 24 averages in time O(n):
            N = np.arange( 1, len(y)+1 )
            allav = np.cumsum(y) / N
                = [ y0,  (y0 + y1) / 2,  (y0 + y1 + y2) / 3 ...]
        An identity:
            spread(y) = sum(y^2) - n * (av y)^2
        Voila: the code below, all spreads() in time O(n).
    
        Exercise: extend this to spreads around least-squares lines
        fit to [ y0,  [y0 y1],  [y0 y1 y2] ... ], not just horizontal lines.
    """
    
    from __future__ import division
    import sys
    import numpy as np
    
    
    #...............................................................................
    def spreads( y ):
        """ [ spread y[:1], spread y[:2] ... spread y ] in time O(n)
            where spread( y[] ) = sum (y - average y )^2
                = n * variance(y)
        """
        N = np.arange( 1, len(y)+1 )
        return np.cumsum( y**2 ) - np.cumsum( y )**2 / N
    
    def spreads_allij( y ):
        """ -> A[i,j] = sum (y - av y)^2, spread of y around its average
            for all y[i:j+1] 
            time, space O(n^2)
        """
        y = np.asanyarray( y, dtype=float )
        n = len(y)
        A = np.zeros((n,n))
        for i in range(n):
            A[i,i:] = spreads( y[i:] )
        return A
    

    到目前为止,我们有一个 n x n 成本矩阵,空间 O(n^2)。 下到空间 O(n k ), 仔细查看 dyn-prog 代码中Cost[i,j] 访问的模式:

    for nleft .. to n:
        Cost_nleft = Cost[j,nleft ]  -- time nleft or nleft^2
        for k in 3 4 5 ...:
            min [ C[k-1, j-1] + Cost_nleft[j]  for j .. to nleft ]
    

    这里Cost_nleft 是完整的 n x n 成本矩阵的一行,约 n 段,根据需要生成。 对于线段,这可以在 O(n) 时间内完成。 但是如果“通过一组 n 个点的 一个 段的错误需要 O(n) 时间”, 看来我们已经到了 O(n^3) 的时间了。有人评论吗?

    【讨论】:

      猜你喜欢
      • 2011-05-04
      • 2011-04-20
      • 1970-01-01
      • 2016-07-29
      • 2021-08-14
      • 1970-01-01
      • 2012-07-13
      • 2014-04-28
      • 2012-02-10
      相关资源
      最近更新 更多