python代码完成时间太长答案

【问题标题】：python code takes too long to finishpython代码完成时间太长
【发布时间】：2016-06-20 23:54:50
【问题描述】：

让我坦率地说，我编写代码是为了好玩，这是我过去几天在业余时间一直在进行的代码挑战。挑战在于我得到一堆由空格（文档）分隔的单词，然后是列表中的几个搜索词。我必须在文档中找到最接近这些 searchTerms 的位置。基本上，找到包含所有 searchTerms 的文档的最小子集并输出该子集。到目前为止，我的功能似乎在我的系统上运行。但是，当我上传时，我被告知我的算法执行时间太长。我的想法是在文档中找到 searchTerm 的每个实例，然后针对它运行 itertools.product()。然后我测试每一个，根据索引值确定哪一个是最短的。这是我目前所拥有的：

def answer(document, searchTerms):
    from itertools import product

    #build a list of the input document
    document = document.split()

    index = []
    #find all indexes for the searchTerms and build a list of lists 
    for w in searchTerms:
        index.append([i for i,x in enumerate(document) if x == w])

    #build iterator of all possible combinations of indexes for each of the searchTerms
    combinations = product(*index)

    #recover memory
    del index

    #build tuple of minimum distance between all search terms
    shortest  = min(((max(x) - min(x),x) for x in combinations),key=lambda x: x[0])

    return (' '.join(document[min(shortest[1]):max(shortest[1])+1]))

我尝试使用多处理来加速我的代码部分，但还没有完全掌握正确的语法。例如：

from multiprocessing import Pool
p = Pool(processes=2)
shortest = p.map(min_max,combinations)

def min_max(combinations):
    return min(((max(x) - min(x),x) for x in combinations))

结果：

Traceback (most recent call last):
File "./searchTerms2.py", line 65, in <module>
print (answer(document,searchTerms))
File "./searchTerms2.py", line 45, in answer
shortest = p.map(min_max,combinations)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
TypeError: 'int' object is not iterable

任何指针将不胜感激。有没有更好的方法来解决这个问题？在哪些方面我可以提高效率？

--编辑-- 问题进一步解释：

document = 'this is a song that never ends it goes on and on my friend some people started singing it not knowing what it was and they will continue singing it forever just because this is the song'

searchTerms = ['this', 'goes','on']

应该导致：

'this is a song that never ends it goes on'

这适用于我当前的算法，但如果给定更大的文档和 searchTerms，则速度不够快。我希望这更清楚...

我一直在为我的代码计时，看来我最大的性能影响来自：

shortest  = min(((max(x) - min(x),x) for x in combinations),key=lambda x: x[0])

随着我增加“文档”中的字数并在“搜索词”中添加额外的搜索词，我看到该行的性能受到很大影响。其他一切都与我能说的相差无几..

【问题讨论】：

如果您的目标是速度，那么任何带有#recover memory 的行都可能不存在，除非您正在处理可能会将您送入虚拟内存的多 GB 数组。
不是为每个搜索词执行index.append([i for i,x in enumerate(document) if x == w])，而是执行index.append([i for i,x in enumerate(document) if x in searchTerms])。此外，将searchTerms 转换为set 以加快查找速度。
我会把它变成一个答案，因为这可能是你绊倒的原因。
关于多处理：p.map 将min_max 应用于combinations 的每个元素。看来您正在期待迭代器本身。
@Mad Physicist index.append([i for i,x in enumerate(document) if x in searchTerms]) 生成一个索引值列表，该列表不区分哪个 searchTerm 生成的索引与在多个地方发现的相同 searchTerm 相对。这意味着距离不会在包含尽可能接近的所有 searchTerms 的文档中当场归零。我在这里错过了什么吗？

标签： python performance multiprocessing processing-efficiency coding-efficiency

【解决方案1】：

我已经考虑这个问题一天了，我觉得它很有趣。将“文档”视为一条线，将每个“单词”视为该线上的一个点会有所帮助。那么，任何解决方案都是一个窗口/范围覆盖这条线的一部分，左侧（start）和右侧（end）。

尝试的解决方案： Mad Physicist 的解决方案不起作用的原因是它从这条线上的一个点开始，并且当它们实际上包含很多重叠时，将这个点与所有其他点之间的距离视为正交。它只选择每个匹配搜索词的最近点，这限制了搜索的解决方案空间，因此错过了一些解决方案。不难找到一个例子，比如：

document = 'a x x d x x x a j x x'
searchTerms = 'a d j'.split()

它以d 开始，然后选择最接近的a，此时更远的a 将产生更短的整体解决方案。

蛮力解决方案： 您在问题中的解决方案使用product 生成可能的解决方案并检查每个解决方案。对于像您发布的示例这样的小问题，这很好，实际上速度很快，但是随着文档长度的增加，尤其是搜索词的数量，product 的组合数量迅速增长。

新解决方案： 我们可以做的第一件事是意识到任何不包括其最小和最大索引之间的所有点的组合都是无效的。这消除了很多组合，因此您实际上只选择 (start, end) 点的组合，而不管搜索词的数量。

虽然生成这些组合可能有一些花哨的数学公式，但我采用了不同的方法......

如果您将每个搜索词的索引单独视为从最低索引到最高索引的迷你窗口，很明显解决方案的 end 索引的下限是最大 start 所有这些范围的索引。这是因为任何 end 索引较低的窗口都不会包含此搜索词。 start 索引的下限就是最低匹配索引。

这个 (start,end) 必须是一个解，所以我们将其用作初始猜测，然后按照以下过程：

它有助于创建所有匹配索引的平面列表并在此列表上工作，因为所有其他索引都不相关。在这种情况下start = 0.

将 start 索引提高到下一个匹配索引（平面列表上的start++）。这会将最左边的匹配项弹出窗口。获取不小于 start 的下一个匹配索引。如果这个索引已经在范围内，那么我们减少了冗余匹配并有另一个解决方案。如果此索引超出右侧范围，请移动 end 以扩大范围以再次包含此匹配项。如果没有更多可用的匹配项，那么我们已经用完了解决方案。

重复这个过程，直到没有更多的解决方案，跟踪哪个解决方案产生最短的 range = end - start。这是最终的解决方案。

测试：

为了在测试中获得更多变化并验证我的解决方案产生的解决方案与原始解决方案相同，我从document 中随机抓取k 搜索词：

import random
terms = random.sample(document.split(), random.randint(3,5))
print(terms)
s1 = answer_product(document, terms)
s2 = answer_window(document, terms)

assert s1 == s2

然后尝试做一个简单的基准测试：

import timeit
N = 1000
for j in xrange(2,8):
    terms = random.sample(document.split(), j)
    print(N,j)
    print('window: %s s'%timeit.timeit(lambda: answer_window(document*4, terms), number=N))
    print('product: %s s'%timeit.timeit(lambda: answer_product(document*4, terms), number=N))

在我的电脑上，对于N=1000,k=2 的小机箱，它们在t~=0.03s 附近都非常快。但是，随着k 增长到k=7，answer_product 所需的时间增长到t>20s 而answer_window 仍然是t~=0.03s。请注意，由于我没有用于测试的实际“文档”，因此我只是将示例乘以 4 以增加搜索量。

╔═════╦═══╦═══════════════════╦════════════════════╦═══════╗ ║ N ║ k ║ answer_window (s) ║ answer_product (s) ║ p/w ║ ╠═════╬═══╬═══════════════════╬════════════════════╬═══════╣ ║ 1e3 ║ 2 ║ 0.0231 ║ 0.0347 ║ 1.5 ║ ║ 1e3 ║ 3 ║ 0.0227 ║ 0.058 ║ 2.55 ║ ║ 1e3 ║ 4 ║ 0.025 ║ 0.242 ║ 9.68 ║ ║ 1e3 ║ 5 ║ 0.0326 ║ 3.044 ║ 93.4 ║ ║ 1e3 ║ 6 ║ 0.035 ║ 11.55 ║ 330 ║ ║ 1e3 ║ 7 ║ 0.0299 ║ 23.82 ║ 797 ║ ║ 1e5 ║ 2 ║ 2.2 ║ 2.524 ║ 1.15 ║ ║ 1e5 ║ 3 ║ 2.195 ║ 2.743 ║ 1.25 ║ ║ 1e5 ║ 4 ║ 3.272 ║ 46.51 ║ 14.2 ║ ║ 1e5 ║ 5 ║ 3.74 ║ 67.71 ║ 18.1 ║ ║ 1e5 ║ 6 ║ 3.52 ║ 1137 ║ 323 ║ ║ 1e5 ║ 7 ║ 3.98 ║ 4519 ║ 1135 ║ ╚═════╩═══╩═══════════════════╩════════════════════╩═══════╝

代码：

def answer_window(doc, terms):
    doc = doc.split()
    # create a grouping of indices by match and a flat array of all match
    # indices
    index = {w:[] for w in terms}
    indices = []
    j = 0
    for (i, w) in enumerate(doc):
        if w in index:
            # save real doc indices in flat array and use indices into that
            # array to simplify stepping.  both are automatically ordered
            indices.append(i)
            index[w].append(j)
            j += 1
    # find the maximum leftmost match index.  this is the lower bound on the
    # right side of the solution window
    highest_min = max(v[0] for v in index.values())

    # start with lowest minimum index (first) and highest minimum index (which
    # is the lower bound on the right side). this must be a solution.
    # then look for a shorter one by stepping the left side, replacing lost
    # matches from the right (expanding when necessary) until the left cannot
    # be advanced anymore.  this will cover all possible solution windows and the
    # one with the shortest length is saved
    start, end = 0, highest_min
    sol = start, end
    dsol = indices[sol[1]]-indices[sol[0]]
    while True:
        # pop leftmost match
        pop = doc[indices[start]]
        start += 1
        # need to make sure we still have the match we popped in the range
        for j in index[pop]:
            if j >= start:
                # another copy to the right!
                if j > end:
                    # must expand end to include the replacement
                    end = j
                    if indices[end]-indices[start] < dsol:
                        # new window is shorter than sol
                        sol = start, end
                        dsol = indices[sol[1]]-indices[sol[0]]
                elif indices[end]-indices[start] < dsol:
                    # the replacement is already inside the range, and moving
                    # the left side made the window smaller than sol
                    sol = start,end
                    dsol = indices[sol[1]]-indices[sol[0]]
                break # done with this pop
            else:
                # this match is left of our window
                pass
        else:
            # found no replacement, can't shrink left side anymore so we are
            # out of solutions
            break
    return (' '.join(doc[indices[sol[0]]:indices[sol[1]]+1]))

【讨论】：

【解决方案2】：

代码中的主要减速来自于您在不同单词之间寻找所有索引组合的事实。显然，这些组合中的大多数甚至都不符合最短运行的条件。这是一种应该运行得更快的算法：

查找所有搜索词的索引，按搜索词分隔（如您所做的那样）
使用匹配最少的术语作为基础。
对于每一次出现的基词，找出每个其他词中最接近的词（请注意，这比计算每个组合的距离要快得多）。最近邻居的最大散布是候选运行的长度。
找到最短的候选运行并将其返回。

此版本使用字典解析和 key 参数到 min。但是，它不使用__builtins__ 模块之外的任何东西。下面的代码在 Python 2.7 和 3.5 下运行：

def answer(document, searchTerms):
    #build a list of the input document
    document = document.split()

    # construct list of indices of occurrences for each term
    indices = {w: [i for i,x in enumerate(document) if x == w] for w in searchTerms}

    # find the least frequent term and isolate it
    leastFrequent = min(indices.keys(), key=lambda x: len(indices[x]))
    loopIndex = indices[leastFrequent]
    del indices[leastFrequent]

    # for each element of leastFrequent, compute the nearest distance to each other item
    candidates = [None] * len(loopIndex)
    for index, element in enumerate(loopIndex):
        neighbors = [None] * len(indices)

        # find the distance to the nearest neighbor in each other list
        for ind, term in enumerate(indices):
            neighbors[ind] = min(indices[term], key=lambda x, e=element: abs(x - e))

        # the run length is the maximum of the maximum and element minus the minimum of the minimum and element
        start = min(min(neighbors), element)
        end = max(max(neighbors), element) + 1
        length = end - start
        candidates[index] = length, start, end

    # get the shortest candidate segment
    winner = min(candidates, key=lambda x: x[0])
    return ' '.join(document[winner[1]:winner[2]])

如果有s 搜索词出现在（几何）平均k 次，则此算法将在大约O(k * s * k) = O(s * k^2) 时间内运行。 k 的因素来自element 上的循环以及其中对min 的调用。 k 的因子来自term 上的循环。通过以最不频繁的元素为基础，我们显着减少了k 术语之一。尤其是其中一个词项只出现一次的情况，保证在每一种可能的组合中都有，所以外层循环只运行一次。

为了比较，您的实现使用itertools.product，它产生s 嵌套循环，每个循环运行k 迭代。这大约需要 O(k^s) 运行时间。

【讨论】：

哇。我喜欢这里的思考过程。我仍在消化这里的每个元素，但并排比较表明你的代码比我写的要快得多！我无法思考如何避免必须测试每个组合，这在我扩大文档和搜索术语时会成为一个大问题。我真的很感激所有的工作。这绝对有助于我的教育！
@RonWellman。我很高兴脑筋急转弯，你似乎已经投入了很多工作，我很高兴能够提供帮助。
另外，我刚刚修正了一个主要错字：not->note in step #3
我真的很感激。我正在针对一些我认为会导致我出现问题的边缘情况稍微修改您的代码。显然，我无法说出正在测试的代码是什么，所以我必须发挥我的想象力。修改 (1) 仅给出一个 searchTerm（导致崩溃）(2) 多次给出相同的 searchTerm，这与您的字典不匹配 (3) 等。再次，我真的很欣赏这种洞察力。您的代码为攻击这种性质的问题提供了独特的视角。通过找到所有可能的组合，我的方法类似于蛮力！
@RonWellman 您可以随时为问题投票和/或在您的代表允许的情况下选择它。