分段最小二乘答案

【问题标题】：Segmented Least Squares分段最小二乘
【发布时间】：2014-03-15 09:37:17
【问题描述】：

给出一个算法，它将平面中的一系列点 (x_1, y_1), (x_2, y_2), ...., (x_n, y_n) 和一个整数 k 作为输入并返回最佳分段线性函数f 由最多 k 个片段组成，使平方和误差最小化。您可以假设您可以访问一种算法，该算法通过 Θ(n) 时间内的一组 n 个点计算一个段的平方和误差。解决方案应使用 O(n^2k) 时间和 O(nk) 空间。

谁能帮我解决这个问题？非常感谢！

【问题讨论】：

O(n^(2k)) 或 O(n^2 * k)?
另外，函数必须是连续的吗？
O(n^2 * k)，函数不需要连续。
我认为 k = o(n) 是不可能的，因为您可能需要为输入的每个中缀（子字符串/连续子序列）至少计算最小二乘。仅这些就花费了您的 Ω(n^3) OP 和您的 O(n) 算法黑盒。问题的其余部分暗示像我提出的那样的 DP 解决方案是所需的解决方案，所以 O(n^2 * k) 可能是问题设置者的错误，或者您遗漏了一些信息？

标签： algorithm dynamic-programming

【解决方案1】：

如果您可以对n^2 中的某些段进行最小二乘，那么使用动态编程很容易在n^2 k^2 中执行您想要的操作。您可能只能将其优化为单个 k。

【讨论】：

【解决方案2】：

（这对你的家庭作业来说太晚了，但希望它对你有帮助。）
首先是 python / numpy 中的动态编程，仅适用于k = 4，帮助您了解动态编程的工作原理；一旦你理解了这一点，为任何 k 编写一个循环应该很容易。
此外，Cost[] 是一个二维矩阵，空间 O(n^2)；下到空间 O(n k) 见文末注释

#!/usr/bin/env python
""" split4.py: min-cost split into 4 pieces, dynamic programming k=4 """

from __future__ import division
import numpy as np

__version__ = "2014-03-09 mar denis"

#...............................................................................
def split4( Cost, verbose=1 ):
    """ split4.py: min-cost split into 4 pieces, dynamic programming k=4
        min Cost[0:a] + Cost[a:b] + Cost[b:c] + Cost[c:n]
        Cost[a,b] = error in least-squares line fit to xy[a] .. xy[b] *including b*
        or error in lsq horizontal lines, sum (y_j - av y) ^2 for each piece --

          o--
        o-
             o---
                 o----
        | |  |   |
        0 2  5   9

        (Why 4 ? to walk through step by step, then put in a loop)
    """
        # speedup: maxlen 2 n/k or so

    Cost = np.asanyarray(Cost)
    n = Cost.shape[1]
        # C2 C3 ... costs, J2 J3 ... indices of best splits
    J2 = - np.ones(n, dtype=int)  # -1, NaN mark undefined / bug
    C2 = np.ones(n) * np.NaN
    J3 = - np.ones(n, dtype=int)
    C3 = np.ones(n) * np.NaN

        # best 2-splits of the left 2 3 4 ...
    for nleft in range( 1, n ):
        J2[nleft] = j = np.argmin([ Cost[0,j-1] + Cost[j,nleft]  for j in range( 1, nleft+1 )]) + 1
        C2[nleft] =                 Cost[0,j-1] + Cost[j,nleft]
            # an idiom for argmin j, min value c together

        # best 3-splits of the left 3 4 5 ...
    for nleft in range( 2, n ):
        J3[nleft] = j = np.argmin([ C2[j-1] + Cost[j,nleft]  for j in range( 2, nleft+1 )]) + 2
        C3[nleft] =                 C2[j-1] + Cost[j,nleft]

        # best 4-split of all n --
    j4 = np.argmin([ C3[j-1] + Cost[j,n-1]  for j in range( 3, n )]) + 3
    c4 =             C3[j4-1] + Cost[j4,n-1]
    j3 = J3[j4]
    j2 = J2[j3]
    jsplit = np.array([ 0, j2, j3, j4, n ])
    if verbose:
        print "split4: len %s  pos %s  cost %.3g" % (np.diff(jsplit), jsplit, c4)
        print "split4: J2 %s  C2 %s" %(J2, C2)
        print "split4: J3 %s  C3 %s" %(J3, C3)
    return jsplit

#...............................................................................
if __name__ == "__main__":
    import random
    import sys
    import spread

    n = 10
    ncycle = 2
    plot = 0
    seed = 0

        # run this.py a=1 b=None c=[3] 'd = expr' ...  in sh or ipython
    for arg in sys.argv[1:]:
        exec( arg )
    np.set_printoptions( 1, threshold=100, edgeitems=10, linewidth=100, suppress=True )
    np.random.seed(seed)
    random.seed(seed)

    print "\n", 80 * "-"
    title = "Dynamic programming least-square horizontal lines  %s  n %d  seed %d" % (
            __file__, n, seed)
    print title

    x = np.arange( n + 0. )
    y = np.sin( 2*np.pi * x * ncycle / n )
        # synthetic time series ?
    print "y: %s  av %.3g  variance %.3g" % (y, y.mean(), np.var(y))

    print "Cost[j,k] = sum (y - av y)^2 --"  # len * var y[j:k+1]
    Cost = spread.spreads_allij( y )
    print Cost  # .round().astype(int)

    jsplit = split4( Cost )
        # split4: len [3 2 3 2]  pos [ 0  3  5  8 10]

    if plot:
        import matplotlib.pyplot as pl
        title += "\n lengths: %s" % np.diff(jsplit)
        pl.title( title )
        pl.plot( y )
        for js, js1 in zip( jsplit[:-1], jsplit[1:] ):
            if js1 <= js:  continue
            yav = y[js:js1].mean() * np.ones( js1 - js + 1 )
            pl.plot( np.arange( js, js1 + 1 ), yav )
        # pl.legend()
        pl.show()

然后，以下代码仅对水平线执行Cost[]，斜率为0；在 O(n) 时间内将其扩展到任何斜率的线段，作为练习。

""" spreads( all y[:j] ) in time O(n)

define spread( y[] ) = sum (y - average y)^2
    e.g. spread of 24 hourly temperatures y[0:24] i.e. y[0] .. y[23]
    around a horizontal line at the average temperature
        (spread = 0 for constant temperature,
        24 c^2 for constant + [c -c c -c ...],
        24 * variance(y) )

    How fast can one compute all 24 spreads
    1 hour (midnight to 1 am), 2 hours ... all 24 ?

    A simpler problem: compute all 24 averages in time O(n):
        N = np.arange( 1, len(y)+1 )
        allav = np.cumsum(y) / N
            = [ y0,  (y0 + y1) / 2,  (y0 + y1 + y2) / 3 ...]
    An identity:
        spread(y) = sum(y^2) - n * (av y)^2
    Voila: the code below, all spreads() in time O(n).

    Exercise: extend this to spreads around least-squares lines
    fit to [ y0,  [y0 y1],  [y0 y1 y2] ... ], not just horizontal lines.
"""

from __future__ import division
import sys
import numpy as np


#...............................................................................
def spreads( y ):
    """ [ spread y[:1], spread y[:2] ... spread y ] in time O(n)
        where spread( y[] ) = sum (y - average y )^2
            = n * variance(y)
    """
    N = np.arange( 1, len(y)+1 )
    return np.cumsum( y**2 ) - np.cumsum( y )**2 / N

def spreads_allij( y ):
    """ -> A[i,j] = sum (y - av y)^2, spread of y around its average
        for all y[i:j+1] 
        time, space O(n^2)
    """
    y = np.asanyarray( y, dtype=float )
    n = len(y)
    A = np.zeros((n,n))
    for i in range(n):
        A[i,i:] = spreads( y[i:] )
    return A

到目前为止，我们有一个 n x n 成本矩阵，空间 O(n^2)。下到空间 O(n k )，仔细查看 dyn-prog 代码中Cost[i,j] 访问的模式：

for nleft .. to n:
    Cost_nleft = Cost[j,nleft ]  -- time nleft or nleft^2
    for k in 3 4 5 ...:
        min [ C[k-1, j-1] + Cost_nleft[j]  for j .. to nleft ]

这里Cost_nleft 是完整的 n x n 成本矩阵的一行，约 n 段，根据需要生成。对于线段，这可以在 O(n) 时间内完成。但是如果“通过一组 n 个点的一个段的错误需要 O(n) 时间”，看来我们已经到了 O(n^3) 的时间了。有人评论吗？

【讨论】：