令人困惑的循环问题（python）答案

【问题标题】：Confusing loop problem (python)令人困惑的循环问题（python）
【发布时间】：2010-08-24 21:34:36
【问题描述】：

这类似于merge sort in python 中的问题我正在重申，因为我认为我没有很好地解释那里的问题。

基本上我有一系列大约 1000 个文件，所有文件都包含域名。总共数据> 1gig，所以我试图避免将所有数据加载到ram中。每个单独的文件都使用 .sort(get_tld) 进行了排序，该文件根据其 TLD 对数据进行了排序（而不是根据其域名。将所有 .com 和 .orgs 一起排序等）

一个典型的文件可能看起来像

something.ca
somethingelse.ca
somethingnew.com
another.net
whatever.org
etc.org

但显然更长。

我现在想将所有文件合并为一个，保持排序，以便最后一个大文件仍然将所有 .com 和 .orgs 放在一起，等等。

我想做的基本上是

open all the files
loop:
    read 1 line from each open file
    put them all in a list and sort with .sort(get_tld)
    write each item from the list to a new file

我遇到的问题是我不知道如何遍历文件我不能使用 with open() as 因为我没有打开 1 个文件来循环，我有很多。而且它们的长度都是可变的，所以我必须确保一直通过最长的一个。

非常感谢任何建议。

【问题讨论】：

这又是stackoverflow.com/questions/3559807/merge-sort-in-python，对吧？有什么不同？我看不出有什么区别。也许如果您强调了这两个问题之间的实际差异。

标签： python sorting loops

【解决方案1】：

您是否能够一次保存 1000 个文件是一个单独的问题，取决于您的操作系统及其配置；如果没有，您必须分两步进行 - 将 N 个文件组合并到临时文件中，然后将临时文件合并到最终结果文件中（两个步骤就足够了，因为它们可以让您合并总共 N 个平方文件；只要 N 至少为 32，因此应该可以合并 1000 个文件）。在任何情况下，这都是与“将 N 个输入文件合并到一个输出文件”任务不同的问题（只是您是调用该函数一次还是重复调用该函数的问题）。

该函数的总体思路是保留一个优先级队列（模块heapq 擅长于此；-）包含“排序键”（在您的情况下为当前 TLD）的小列表，然后是最后一行从文件中读取，最后打开的文件准备好读取下一行（并且在两者之间有一些不同的东西，以确保正常的字典顺序不会意外地尝试比较两个打开的文件，这会失败）。我认为一些代码可能是解释一般想法的最佳方式，所以接下来我将编辑这个答案以提供代码（但是我没有时间 test 它，所以把它当作伪代码交流想法；-)。

import heapq

def merge(inputfiles, outputfile, key):
  """inputfiles: list of input, sorted files open for reading.
     outputfile: output file open for writing.
     key: callable supplying the "key" to use for each line.
  """
  # prepare the heap: items are lists with [thekey, k, theline, thefile]
  # where k is an arbitrary int guaranteed to be different for all items,
  # theline is the last line read from thefile and not yet written out,
  # (guaranteed to be a non-empty string), thekey is key(theline), and
  # thefile is the open file
  h = [(k, i.readline(), i) for k, i in enumerate(inputfiles)]
  h = [[key(s), k, s, i] for k, s, i in h if s]
  heapq.heapify(h)

  while h:
    # get and output the lowest available item (==available item w/lowest key)
    item = heapq.heappop(h)
    outputfile.write(item[2])

    # replenish the item with the _next_ line from its file (if any)
    item[2] = item[3].readline()
    if not item[2]: continue  # don't reinsert finished files

    # compute the key, and re-insert the item appropriately
    item[0] = key(item[2])
    heapq.heappush(h, item)

当然，在您的情况下，作为key 函数，您将需要一个提取顶级域的函数，给定一行是域名（带有尾随换行符） - 在上一个问题中您已经指出为此目的，将 urlparse 模块比字符串操作更可取。如果你坚持字符串操作，

def tld(domain):
  return domain.rsplit('.', 1)[-1].strip()

或类似的东西在这种约束下可能是一种合理的方法。

如果您使用 Python 2.6 或更高版本，heapq.merge 是显而易见的替代方案，但在这种情况下，您需要自己准备迭代器（包括确保“打开的文件对象”永远不会被意外比较...）使用与我在上面更可移植的代码中使用的类似的“装饰/取消装饰”方法。

【讨论】：

【解决方案2】：

您想使用归并排序，例如heapq.merge。我不确定您的操作系统是否允许您同时打开 1000 个文件。如果没有，您可能需要进行 2 次或更多遍。

【讨论】：

【解决方案3】：

为什么不按第一个字母划分域，所以您只需将源文件拆分为 26 个或更多文件，这些文件可以命名为：domains-a.dat、domains-b.dat。然后，您可以将它们完全加载到 RAM 中并对其进行排序，然后将它们写到一个公共文件中。

所以： 3 个输入文件分成 26 个以上的源文件可以单独加载 26 个以上的源文件，在 RAM 中排序，然后写入组合文件。

如果 26 个文件还不够，我相信您可以拆分成更多文件...domains-ab.dat。关键是文件便宜且易于使用（在 Python 和许多其他语言中），您应该利用它们来发挥自己的优势。

【讨论】：

【解决方案4】：

您合并已排序文件的算法不正确。您所做的是从每个文件中读取一行，在所有读取的行中找到排名最低的项目，然后将其写入输出文件。重复此过程（忽略 EOF 中的任何文件），直到到达所有文件的末尾。

【讨论】：

【解决方案5】：

#! /usr/bin/env python

"""Usage: unconfuse.py file1 file2 ... fileN

Reads a list of domain names from each file, and writes them to standard output grouped by TLD.
"""

import sys, os

spools = {}

for name in sys.argv[1:]:
    for line in file(name):
        if (line == "\n"): continue
        tld = line[line.rindex(".")+1:-1]
        spool = spools.get(tld, None)
        if (spool == None):
            spool = file(tld + ".spool", "w+")
            spools[tld] = spool
        spool.write(line)

for tld in sorted(spools.iterkeys()):
    spool = spools[tld]
    spool.seek(0)
    for line in spool:
        sys.stdout.write(line)
    spool.close()
    os.remove(spool.name)

【讨论】：

要么其他人都误解了这个问题，并试图解决一个更难的问题，而这不是@d-c 试图解决的问题，要么我有，这是没用的。准确地说，他对排序根本不真正感兴趣，只是对一些已经排序的列表进行合并，以获得一个小的 'sorted' 值。
无视这一点，我糟透了 Blochs。我们都正确理解了这个问题，而我的只是一个简单但效率低下的解决方案。