python - 在多个列表中拆分巨大的列表；遍历它们中的每一个答案

【问题标题】：python - splitting huge list in multiple lists; loop over each of thempython - 在多个列表中拆分巨大的列表；遍历它们中的每一个
【发布时间】：2017-12-10 06:44:05
【问题描述】：

参考我的this 问题，我意识到语料库太大，需要在进行 levenshtein 计算之前分成多个迷你列表。以下代码是我的简单尝试，但我想知道是否有更优雅的方法：

import csv#, StringIO
import itertools, Levenshtein

# open the newline-separated list of words
path = '/Users/path/'
file = path + 'wordlist.txt'
output1 = path + 'ortho1.csv'
output2 = path + 'ortho2.csv'
output3 = path + 'ortho3.csv'
output4 = path + 'ortho4.csv'
output5 = path + 'ortho5.csv'
output6 = path + 'ortho6.csv'

words = sorted(set(s.strip() for s in open(file)))

# words is a list with 16349, so I split it in to 6 mini lists
verbs1 = words[:3269]
verbs2 = words[3269:13080]
verbs3 = words[13081:9811]
verbs4 = words[9812:6542]
verbs5 = words[6543:3273]
verbs6 = words[3274:len(words)]

对于上面的每个列表，我计算以下循环：

with open(output1, 'wb') as f:  
   writer = csv.writer(f, delimiter=",", lineterminator="\n")   
   for a, b in itertools.product(verbs1, words):        
       if (a < b and Levenshtein.distance(a,b) <= 5):
                    writer.writerow([a, b, Levenshtein.distance(a,b)])

再次，一切正常，但我想知道有一种更优雅的方法可以为每个迷你列表编写一个循环。

【问题讨论】：

words[13081:9811] 你试过这个吗？这不只是一个空列表，因为“to”索引低于“from”索引吗？此外，您可能应该使用列表而不是 6 个单独的变量。
您可以将动词拆分为列表字典，您的第一个代码部分很有趣，其余部分很好。
另外，如果您为所有人制作产品，我看不出拆分列表对您有什么帮助。 a*x + b*x + c*x 与 (a+b+c) * x 相同。
@tobias_k：你是对的！我弄乱了索引。我的问题并不清楚完全清楚 - excel 无法打开单个文件，所以我试图将单个 csv 拆分为多个 csv。但我似乎根本没有成功——有什么提示吗？
如果您只对距离

标签： python loops subset

【解决方案1】：

您的代码存在一些问题，还有一些您可以改进的地方：

不要为verbs 和output 分别设置六个不同的变量，而是使用两个列表；这样您可以更轻松地调整“拆分点”或子列表的数量，并且您不必复制粘贴代码块来比较单词六次；只需使用另一个循环
子列表words[13081:9811] 为空，以及第二个索引小于第一个的任何其他索引
对于verbs1 = words[:3269] 和verbs2 = words[3269:13080]，words[3269] 将不在子列表中，因为第二个索引是专有；以下列表相同
万一这是您的意图，拆分列表不会降低复杂性或运行时间，因为您仍然需要比较每个单词； a*x + b*x + c*x 与 (a+b+c) * x 相同
不要检查a < b 并取消一半product，而是使用combinations（但这仅在不拆分列表时有效）
如果您只对编辑距离<= 5 的配对感兴趣，您可以先做一些其他检查，例如比较两个单词的长度，或者包含字符的集合差异；这两项检查都会比实际编辑距离检查（O(n²)）更快，并且可能会排除许多组合
出于同样的原因，不要计算两次编辑距离，一次是在检查中，一次是为了将其写入文件，而只是一次并将其存储在临时变量中
如果您拆分文件以使输出文件不会变得太大而 Excel 无法处理（据我了解您的一个 cmets），您的方法可能不起作用，因为输出文件的大小可能会有很大差异，具体取决于该子列表中有多少匹配项

结合以上，你可以尝试这样的事情（未测试）：

path = '/Users/path/'
with open(path + 'wordlist.txt') as infile:
    words = set(s.strip() for s in infile)

combs = itertools.combinations(words, 2)
max_count = 10**6 # or whatever Excel can handle
for i, chunk in enumerate(chunks(combs, max_count)):
    with open("%sortho%d.csv" % (path, i), "w") as outfile:
        writer = csv.writer(outfile, delimiter=",", lineterminator="\n")   
        for a, b in chunk:
            if might_be_close(a, b, 5):
                d = Levenshtein.distance(a,b)
                if d <= 5:
                    writer.writerow([a, b, d])

这里，chunks 是 split an iterator into chunks 的一个函数，might_be_close 是一个辅助函数，例如比较如上所述的长度或包含的字母集。 output 文件的大小可能仍然不同，但永远不会超过 max_count。

或者试试这个，以获得带有确切 max_count 条目的输出文件：

max_count = 10**6 # or whatever Excel can handle
matches = filter_matches(itertools.combinations(words, 2), 5)
for i, chunk in enumerate(chunks(matches, max_count)):
    with open("%sortho%d.csv" % (path, i), "w") as outfile:
        writer = csv.writer(outfile, delimiter=",", lineterminator="\n")   
        for a, b, d in chunk:
            writer.writerow([a, b, d])

def filter_matches(combs, max_dist):
    for a, b in combs:
        if might_be_close(a, b, max_dist):
            d = Levenshtein.distance(a,b)
            if d <= max_dist:
                yield a, b, d

在这里，filter_matches 生成器对组合进行预过滤，我们将它们分块到合适的大小。

【讨论】：

我明白，但我需要一些帮助。我如何在这里使用itertools.combinations(iterable, r)？
@RobertP。好吧，combinations 仅在您不拆分列表时才有效，因为它只接受一个可迭代对象和每个组合的元素数（在您的情况下为 2 个）
对——这就是我感到困惑的原因。所以，我想如果我需要将列表拆分为多个迷你列表（我认为这是让我能够在 excel 上查看所有数据的唯一方法），我无法使用combinations。
谢谢！我会试试这个并马上回复你。
@RobertP。你读过答案吗？您必须自己提供该功能，并进行我提到的预先检查。或者直接离开它并检查每个组合的编辑距离。

【解决方案2】：

把动词放在一个列表中：

verbs = [words[:3269],words[3269:13080],words[13081:9811],words[9812:6542],
         words[6543:3273],words[3274:len(words)]]

然后使用该列表的长度创建一个具有相同长度的循环。通过使用索引，我们可以创建路径并访问动词中的正确元素。

for i in range(len(verbs)):
    output = '{}ortho{}.csv'.format(path,i+1)
    with open(output, 'wb') as f:  
        writer = csv.writer(f, delimiter=",", lineterminator="\n")   
        for a, b in itertools.product(verbs[i], words):        
            if (a < b and Levenshtein.distance(a,b) <= 5):
               writer.writerow([a, b, Levenshtein.distance(a,b)])

【讨论】：

您的“然后”部分是错误的。这会将相同的结果写入每个输出文件。您还必须创建一个verbs 列表并在循环中使用它。
是的，这就是我刚刚意识到的。
但如下所示，还有其他问题需要处理。然而，这回答了这个问题：我如何循环这个。