重复删除最大平均子数组答案

【问题标题】：Repeatedly removing the maximum average subarray重复删除最大平均子数组
【发布时间】：2022-03-01 02:19:28
【问题描述】：

我有一个正整数数组。例如：

[1, 7, 8, 4, 2, 1, 4]

“归约操作”找到平均值最高的数组前缀，并将其删除。这里，数组前缀是指一个连续的子数组，其左端是数组的开始，例如上面的[1]或[1, 7]或[1, 7, 8]。使用更长的前缀可以打破平局。

Original array:  [  1,   7,   8,   4,   2,   1,   4]

Prefix averages: [1.0, 4.0, 5.3, 5.0, 4.4, 3.8, 3.9]

-> Delete [1, 7, 8], with maximum average 5.3
-> New array -> [4, 2, 1, 4]

我会重复归约操作，直到数组为空：

[1, 7, 8, 4, 2, 1, 4]
^       ^
[4, 2, 1, 4]
^ ^
[2, 1, 4]
^       ^
[]

现在，实际上执行这些数组修改是不必要的；我只是在寻找将被此过程删除的前缀长度列表，例如上面的[3, 1, 3]。

计算这些前缀长度的有效算法是什么？

天真的方法是在每次迭代中从头开始重新计算所有总和和平均值，以实现 O(n^2) 算法——我在下面附上了 Python 代码。我正在寻找对这种方法的任何改进——最好是低于O(n^2) 的任何解决方案，但是具有相同复杂性但更好的常数因子的算法也会有所帮助。

以下是我尝试过的一些事情（没有成功）：

动态维护前缀和，例如使用Binary Indexed Tree。虽然我可以在O(log n) 时间轻松更新前缀 sums 或找到最大前缀 sum，但我还没有找到任何可以更新 average，因为平均值中的分母正在变化。
重用之前的前缀平均值“排名”——这些排名可能会发生变化，例如在某些数组中，以索引5 结尾的前缀的平均值可能大于以索引6 结尾的前缀，但在删除前3 个元素后，现在以索引2 结尾的前缀可能有一个比以3 结尾的平均值小。
寻找前缀结束的模式；例如，任何最大平均前缀的最右边元素始终是数组中的局部最大值，但不清楚这有多大帮助。

这是朴素的二次方法的有效 Python 实现：

from fractions import Fraction
def find_array_reductions(nums: List[int]) -> List[int]:
    """Return list of lengths of max average prefix reductions."""

    def max_prefix_avg(arr: List[int]) -> Tuple[float, int]:
        """Return value and length of max average prefix in arr."""
        if len(arr) == 0:
            return (-math.inf, 0)

        best_length = 1
        best_average = Fraction(0, 1)
        running_sum = 0

        for i, x in enumerate(arr, 1):
            running_sum += x
            new_average = Fraction(running_sum, i)
            if new_average >= best_average:
                best_average = new_average
                best_length = i

        return (float(best_average), best_length)

    removed_lengths = []
    total_removed = 0

    while total_removed < len(nums):
        _, new_removal = max_prefix_avg(nums[total_removed:])
        removed_lengths.append(new_removal)
        total_removed += new_removal

    return removed_lengths

编辑：最初发布的代码在使用 Python 的 math.isclose() 和浮点比较的默认参数而不是正确分数比较时出现了一个罕见的错误。这已在当前代码中修复。可以在此Try it online link 找到错误示例，如果您好奇的话，还有一个前言解释了导致此错误的确切原因。

【问题讨论】：

导致float 版本失败的数字有多大？我也尝试过使用Fraction 而不是truediv 的版本，但是如果我没记错的话，它会慢10 倍:-(。考虑添加我自己的轻量级Fraction 类，但这会使代码大两倍左右（我不知道它是否会更快）。
嗯...实际上我怀疑是math.isclose 而不是floats 造成了麻烦。也许这个总和是浮点数而不是整数。真的很想看看发生错误的数据。
@Pychopath 是的，是math.isclose() 引起了问题：我刚刚针对您的解决方案测试了失败的测试用例，只有我原来的测试用例受到影响。我正在努力创建一个带有失败案例和我们所有解决方案的在线代码运行器（TIO，就像你的链接），现在。但是，我尚未测试您的新解决方案或任何仅对大型随机输入进行浮点数学比较的解决方案。我认为任何没有真正分数比较的解决方案很可能在某些输入上是不正确的。
使用 random 输入我怀疑他们仍然很有可能成功:-)。当然，您可以使用专门设计的输入使其失败，例如 [2**55, 2**55 - 1] (demo)。
@Pychopath 我已经添加了测试用例的链接和更多解释。除了非常大的输入外，我认为您的解决方案通常会给出正确的答案。如需更多讨论，也许continue this in chat 会更好？ :)

标签： python arrays algorithm data-structures

【解决方案1】：

这个问题有一个有趣的 O(n) 解决方案。

如果你绘制一个累积和与指数的关系图，那么：

子数组中任意两个索引之间的平均值是图中这些点之间直线的斜率。

第一个最高平均前缀将在与 0 形成最高角度的点结束。然后，下一个最高平均前缀必须有一个 较小的平均值，它将在与第一个终点成最高角度的点。继续到数组的末尾，我们发现...

这些平均最高的段正是累积和图的上凸包中的段。

使用monotone chain 算法查找这些段。由于点已经排序，所以需要 O(n) 时间。

# Lengths of the segments in the upper convex hull
# of the cumulative sum graph
def upperSumHullLengths(arr):
    if len(arr) < 2:
        if len(arr) < 1:
            return []
        else:
            return [1]
    
    hull = [(0, 0),(1, arr[0])]
    for x in range(2, len(arr)+1):
        # this has x coordinate x-1
        prevPoint = hull[len(hull) - 1]
        # next point in cumulative sum
        point = (x, prevPoint[1] + arr[x-1])
        # remove points not on the convex hull
        while len(hull) >= 2:
            p0 = hull[len(hull)-2]
            dx0 = prevPoint[0] - p0[0]
            dy0 = prevPoint[1] - p0[1]
            dx1 = x - prevPoint[0]
            dy1 = point[1] - prevPoint[1]
            if dy1*dx0 < dy0*dx1:
                break
            hull.pop()
            prevPoint = p0
        hull.append(point)
    
    return [hull[i+1][0] - hull[i][0] for i in range(0, len(hull)-1)]


print(upperSumHullLengths([  1,   7,   8,   4,   2,   1,   4]))

打印：

[3, 1, 3]

【讨论】：

非常好！链接处的动画是一张价值1000字的图片。
太棒了！这个问题最初来自进程调度问题，因此与计算几何的联系令人惊讶且非常美观。
@kcsquared 不，它选择最长的前缀。在 == 情况下（并列），它会继续删除中间点而不是中断循环。
是的，你是对的，这段代码选择了最长的前缀；我很抱歉。事实上，我的代码是不正确的——由于浮点精度，在大输入上存在一个罕见的错误，现在已修复。在一个整数约为 10^7 的输入中，我的代码给出了 [221, 1]，而您的代码正确地给出了 [222]，我认为这是您的叉积方程中的错误。只有当数组长度 * 总和超过 10^9 时才会发生此错误（因为 Python 的 math.isclose() 的准确度为 10^-9），即便如此，在随机输入上也只有百万分之一的测试失败。
@kcsquared 我没有把它写成一个交叉产品。这是一个分数比较，避免了除法引起的不准确：a/b ad

【解决方案2】：

Matt 和 kcsquared 的解决方案和一些基准的简化版本：

from itertools import accumulate, pairwise

def Matt_Pychoed(arr):
    hull = [(0, 0)]
    for x, y in enumerate(accumulate(arr), 1):
        while len(hull) >= 2:
            (x0, y0), (x1, y1) = hull[-2:]
            dx0 = x1 - x0
            dy0 = y1 - y0
            dx1 = x - x1
            dy1 = y - y1
            if dy1*dx0 < dy0*dx1:
                break
            hull.pop()
        hull.append((x, y))
    return [q[0] - p[0] for p, q in pairwise(hull)]

from itertools import accumulate, count
from operator import truediv

def kc_Pychoed_2(nums):
    removals = []
    while nums:
        averages = map(truediv, accumulate(nums), count(1))
        remove = max(zip(averages, count(1)))[1]
        removals.append(remove)
        nums = nums[remove:]
    return removals

用 100,000 个从 1 到 1000 的随机整数组成的 20 个不同数组进行基准测试：

  min   median   mean     max  
 65 ms  164 ms  159 ms  249 ms  kc
 38 ms   98 ms   92 ms  146 ms  kc_Pychoed_1
 58 ms  127 ms  120 ms  189 ms  kc_Pychoed_2
134 ms  137 ms  138 ms  157 ms  Matt
101 ms  102 ms  103 ms  111 ms  Matt_Pychoed

其中kc_Pychoed_1 是kcsquared，但有整数running_sum，没有math.isclose。我验证所有解决方案对每个输入计算相同的结果。

对于这样的随机数据，kcsquared 似乎介于 O(n) 和 O(n log n) 之间。但如果数组严格递减，它会降级为二次方。对于arr = [1000, 999, 998, ..., 2, 1]，我得到了：

  min   median   mean     max  
102 ms  106 ms  107 ms  116 ms  kc
 60 ms   61 ms   61 ms   62 ms  kc_Pychoed_1
 76 ms   77 ms   77 ms   86 ms  kc_Pychoed_2
  0 ms    1 ms    1 ms    1 ms  Matt
  0 ms    0 ms    0 ms    0 ms  Matt_Pychoed

基准代码 (Try it online!)：

from timeit import default_timer as timer
from statistics import mean, median
import random
from typing import List, Tuple
import math
from itertools import accumulate, count
from operator import truediv

def kc(nums: List[int]) -> List[int]:
    """Return list of lengths of max average prefix reductions."""

    def max_prefix_avg(arr: List[int]) -> Tuple[float, int]:
        """Return value and length of max average prefix in arr"""
        if len(arr) == 0:
            return (-math.inf, 0)
        
        best_length = 1
        best_average = -math.inf
        running_sum = 0.0

        for i, x in enumerate(arr, 1):
            running_sum += x
            new_average = running_sum / i
            
            if (new_average >= best_average
                or math.isclose(new_average, best_average)):
                
                best_average = new_average
                best_length = i

        return (best_average, best_length)

    removed_lengths = []
    total_removed = 0

    while total_removed < len(nums):
        _, new_removal = max_prefix_avg(nums[total_removed:])
        removed_lengths.append(new_removal)
        total_removed += new_removal

    return removed_lengths

def kc_Pychoed_1(nums: List[int]) -> List[int]:
    """Return list of lengths of max average prefix reductions."""

    def max_prefix_avg(arr: List[int]) -> Tuple[float, int]:
        """Return value and length of max average prefix in arr"""
        if len(arr) == 0:
            return (-math.inf, 0)
        
        best_length = 1
        best_average = -math.inf
        running_sum = 0

        for i, x in enumerate(arr, 1):
            running_sum += x
            new_average = running_sum / i
            
            if new_average >= best_average:
                
                best_average = new_average
                best_length = i

        return (best_average, best_length)

    removed_lengths = []
    total_removed = 0

    while total_removed < len(nums):
        _, new_removal = max_prefix_avg(nums[total_removed:])
        removed_lengths.append(new_removal)
        total_removed += new_removal

    return removed_lengths

def kc_Pychoed_2(nums):
    removals = []
    while nums:
        averages = map(truediv, accumulate(nums), count(1))
        remove = max(zip(averages, count(1)))[1]
        removals.append(remove)
        nums = nums[remove:]
    return removals

# Lengths of the segments in the upper convex hull
# of the cumulative sum graph
def Matt(arr):
    if len(arr) < 2:
        if len(arr) < 1:
            return []
        else:
            return [1]
    
    hull = [(0, 0),(1, arr[0])]
    for x in range(2, len(arr)+1):
        # this has x coordinate x-1
        prevPoint = hull[len(hull) - 1]
        # next point in cumulative sum
        point = (x, prevPoint[1] + arr[x-1])
        # remove points not on the convex hull
        while len(hull) >= 2:
            p0 = hull[len(hull)-2]
            dx0 = prevPoint[0] - p0[0]
            dy0 = prevPoint[1] - p0[1]
            dx1 = x - prevPoint[0]
            dy1 = point[1] - prevPoint[1]
            if dy1*dx0 < dy0*dx1:
                break
            hull.pop()
            prevPoint = p0
        hull.append(point)
    
    return [hull[i+1][0] - hull[i][0] for i in range(0, len(hull)-1)]

def pairwise(lst):
    return zip(lst, lst[1:])

def Matt_Pychoed(arr):
    hull = [(0, 0)]
    for x, y in enumerate(accumulate(arr), 1):
        while len(hull) >= 2:
            (x0, y0), (x1, y1) = hull[-2:]
            dx0 = x1 - x0
            dy0 = y1 - y0
            dx1 = x - x1
            dy1 = y - y1
            if dy1*dx0 < dy0*dx1:
                break
            hull.pop()
        hull.append((x, y))
    return [q[0] - p[0] for p, q in pairwise(hull)]

funcs = kc, kc_Pychoed_1, kc_Pychoed_2, Matt, Matt_Pychoed
stats = min, median, mean, max
tss = [[] for _ in funcs]
for r in range(1, 21):
    print(f'After round {r}:')
    arr = random.choices(range(1, 1001), k=100_000)
    # arr = list(range(1000, 1, -1))
    expect = None
    print(*(f'{stat.__name__:^7}' for stat in stats))
    for func, ts in zip(funcs, tss):
        t0 = timer()
        result = func(arr)
        t1 = timer()
        ts.append(t1 - t0)
        if expect is None:
            expect = result
        assert result == expect
        print(*('%3d ms ' % (stat(ts) * 1e3) for stat in stats), func.__name__)
    print()

【讨论】：