在 Python 中修改大型文本文件最后一行的最有效方法答案

【问题标题】：Most efficient way to modify the last line of a large text file in Python在 Python 中修改大型文本文件最后一行的最有效方法
【发布时间】：2016-02-22 01:20:17
【问题描述】：

我需要更新几个超过 2GB 的文件的最后一行，这些文件由 readlines() 无法读取的文本行组成。目前，它可以通过逐行循环来正常工作。但是，我想知道是否有任何编译库可以更有效地实现这一点？谢谢！

目前的做法

    myfile = open("large.XML")
    for line in myfile:
        do_something()

【问题讨论】：

如果是 XML 为什么不使用 XML 解析器？你应该可以实现一个更高效的我用过ElementTree并喜欢它。
@WarrenP 也许 OP 不需要解析 XML？另外，这不是应该避免将文件的一大块读入内存吗？
相关：stackoverflow.com/questions/7171140/…
不是真正的重复，因为这个人想重写文件的末尾，而不是将其加载到内存中。
@Two-BitAlchemist：正如你所建议的那样。关闭了这个问题。谢谢！

标签： python io

【解决方案1】：

如果这真的是基于行的东西（真正的 XML 解析器不是最好的解决方案），mmap 可以在这里提供帮助。

mmap 文件，然后在结果对象上调用.rfind('\n')（当你真的想要它之前的非空行而不是它后面的空“行”时，可能会调整以处理以换行符结尾的文件)。然后，您可以单独切出最后一行。如果需要就地修改文件，可以调整文件大小以减少（或添加）与切片行和新行之间的差异相对应的字节数，然后写回新行。避免读取或写入超出您需要的文件。

示例代码（如有错误请评论）：

import mmap

# In Python 3.1 and earlier, you'd wrap mmap in contextlib.closing; mmap
# didn't support the context manager protocol natively until 3.2; see example below
with open("large.XML", 'r+b') as myfile, mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
    # len(mm) - 1 handles files ending w/newline by getting the prior line
    # + 1 to avoid catching prior newline (and handle one line file seamlessly)
    startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1

    # Get the line (with any newline stripped)
    line = mm[startofline:].rstrip(b'\r\n')

    # Do whatever calculates the new line, decoding/encoding to use str
    # in do_something to simplify; this is an XML file, so I'm assuming UTF-8
    new_line = do_something(line.decode('utf-8')).encode('utf-8')

    # Resize to accommodate the new line (or to strip data beyond the new line)
    mm.resize(startofline + len(new_line))  # + 1 if you need to add a trailing newline
    mm[startofline:] = new_line  # Replace contents; add a b"\n" if needed

显然在没有mremap、mm.resize 的某些系统（例如 OSX）上将无法工作，因此为了支持这些系统，您可能会拆分 with（因此 mmap 在文件对象之前关闭)，并使用基于文件对象的查找、写入和截断来修复文件。以下示例包括我之前提到的 Python 3.1 和更早的特定调整，以使用 contextlib.closing 来保证完整性：

import mmap
from contextlib import closing

with open("large.XML", 'r+b') as myfile:
    with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
        startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        line = mm[startofline:].rstrip(b'\r\n')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline)  # Move to where old line began
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

mmap 与任何其他方法相比的优势在于：

无需再读取超出行本身的文件（意味着文件的 1-2 页，其余部分永远不会被读取或写入）
使用rfind表示可以让Python在C层（在CPython中）做快速查找换行的工作；文件对象的显式 seeks 和 reads 可以匹配“仅读取一页左右”，但您必须手动实现对换行符的搜索

警告： 这种方法行不通（至少，为了避免超过 2 GB 的映射，以及在整个文件可能无法调整大小时处理映射）如果您使用的是 32 位系统并且文件太大而无法映射到内存中。在大多数 32 位系统上，即使在新生成的进程中，您也只有 1-2 GB 的连续地址空间可用；在某些特殊情况下，您可能拥有多达 3-3.5 GB 的用户虚拟地址（尽管您会丢失一些用于堆、堆栈、可执行映射等的连续空间）。 mmap 不需要太多物理内存，但需要连续的地址空间； 64 位操作系统的一大好处是除了最荒谬的情况外，您无需担心虚拟地址空间，因此mmap 可以解决一般情况下无法处理的问题，而不会在 32 位操作系统上增加复杂性位操作系统。目前大多数现代计算机都是 64 位的，但如果您的目标是 32 位系统（在 Windows 上，即使操作系统是 64 位，他们可能已经安装了 32 位版本的 Python）错误，所以同样的问题适用）。这里还有一个示例（假设最后一行不是 100+ MB 长）在 32 位 Python（省略 closing 并为简洁起见导入）即使是大文件：

with open("large.XML", 'r+b') as myfile:
    filesize = myfile.seek(0, 2)
    # Get an offset that only grabs the last 100 MB or so of the file aligned properly
    offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
    with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
        startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        # If line might be > 100 MB long, probably want to check if startofline
        # follows a newline here
        line = mm[startofline:].rstrip(b'\r\n')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline + offset)  # Move to where old line began, adjusted for offset
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

【讨论】：

只需使用mm.rfind(b'\n', 0, len(mm) - 1)。如果最后一个字节是换行符，那将跳过它。如果它是其他任何内容，包括一个字符行或零个字符行，代码仍然可以工作。
无赖，在 OSX 上：“系统错误：mmap：调整大小不可用——没有 mremap()”。看起来解决方案是关闭文件，重新打开，寻找startofline，然后写入。
应该是startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1（注意+1）以保留前一个换行符。它还具有消除对未找到测试的需要的意外效果。
@Harvey：谢谢！我为不支持mmap.resize 的系统提供了另一段代码，并修正了startofline 计算。该死的你一个错误！
老实说，我不知道这与副本上接受的答案是否更好、更差或相同，但我认为您应该考虑将其调整为该问题的答案（除了这）。也就是说，假设它还不存在。这个问题有 19 个答案，我还没有读完。

【解决方案2】：

更新：使用ShadowRanger's answer。它更短更健壮。

为了后代：

读取文件的最后 N 个字节并向后搜索换行符。

#!/usr/bin/env python

with open("test.txt", "wb") as testfile:
    testfile.write('\n'.join(["one", "two", "three"]) + '\n')

with open("test.txt", "r+b") as myfile:
    # Read the last 1kiB of the file
    # we could make this be dynamic, but chances are there's
    # a number like 1kiB that'll work 100% of the time for you
    myfile.seek(0,2)
    filesize = myfile.tell()
    blocksize = min(1024, filesize)
    myfile.seek(-blocksize, 2)
    # search backwards for a newline (excluding very last byte
    # in case the file ends with a newline)
    index = myfile.read().rindex('\n', 0, blocksize - 1)
    # seek to the character just after the newline
    myfile.seek(index + 1 - blocksize, 2)
    # read in the last line of the file
    lastline = myfile.read()
    # modify last_line
    lastline = "Brand New Line!\n"
    # seek back to the start of the last line
    myfile.seek(index + 1 - blocksize, 2)
    # write out new version of the last line
    myfile.write(lastline)
    myfile.truncate()

【讨论】：

可能想要使用rfind，而不是rindex，或者当您可以重写单行时，您将通过抛出异常来处理单行文件。假设它取决于是否已知存在多行。
@ShadowRanger：我开始这样做了，但你不知道你是真的找到了行的开头还是块的开头。我推荐您的答案，同时将我的答案留给人们查看。
啊，对。忘记了块读取。 mmap 的一大优势是您无需担心这类事情。 :-)
这有你需要建立一个 n 的问题，该 n 保证足够大以始终包含最后的换行符，或者安排回退到其他方法（重复越来越大的 n 块可能不是一个很好的后备策略）。
@tripleee：同意。这就是我推荐 mmap 方法的原因。我总是忘记 mmap。