读取内存映射的 bzip2 压缩文件答案

【问题标题】：Reading memory mapped bzip2 compressed file读取内存映射的 bzip2 压缩文件
【发布时间】：2012-09-30 09:02:16
【问题描述】：

所以我正在使用 Wikipedia 转储文件。这是一个经过 bzip 压缩的 XML 文件。我可以将所有文件写入目录，但是当我想进行分析时，我必须重新读取磁盘上的所有文件。这给了我随机访问，但它很慢。我有 ram 将整个 bzip 压缩文件放入 ram。

我可以很好地加载转储文件并读取所有行，但我无法在其中查找，因为它很大。从表面上看，bz2 库必须先读取并捕获偏移量，然后才能将我带到那里（并将其全部解压缩，因为偏移量以解压缩字节为单位）。

无论如何，我正在尝试 mmap 转储文件（~9.5 gigs）并将其加载到 bzip 中。我之前显然想在 bzip 文件上进行测试。

我想将 mmap 文件映射到 BZ2File 以便我可以搜索它（以获取特定的、未压缩的字节偏移量），但看起来，如果不解压缩整个 mmap 文件，这是不可能的（这将是超过 30 GB）。

我有什么选择吗？

这是我为测试而编写的一些代码。

import bz2
import mmap

lines = '''This is my first line
This is the second
And the third
'''

with open("bz2TestFile", "wb") as f:
    f.write(bz2.compress(lines))

with open("bz2TestFile", "rb") as f:
    mapped = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)

    print "Part of MMAPPED"
    # This does not work until I hit a minimum length
    # due to (I believe) the checksums in the bz2 algorithm
    #
    for x in range(len(mapped)+2):
        line = mapped[0:x]
        try:
            print x
            print bz2.decompress(line)
        except:
            pass

# I can decompress the entire mmapped file
print ":entire mmap file:"
print bz2.decompress(mapped)

# I can create a bz2File object from the file path
# Is there a way to map the mmap object to this function?
print ":BZ2 File readline:"
bzF = bz2.BZ2File("bz2TestFile")

# Seek to specific offset
bzF.seek(22)
# Read the data
print bzF.readline()

这一切都让我想知道，bz2 文件对象有什么特别之处，它允许它在搜索后读取一行？是否必须先读取每一行才能从算法中获取校验和才能正确计算？

【问题讨论】：

这是BZ2格式的限制；在解压缩整个该死的东西之前，您无法知道文件中任何内容的大小。
如果文件是静态的，我可以解压一次，得到我需要的数据，然后使用这些信息即时解压吗？还是我应该尝试不同的压缩格式？
我不知道；我会改用gzip压缩，它更适合流式和灵活解压。

标签： python mmap bzip2

【解决方案1】：

我找到了答案！ James Taylor 编写了几个用于查找 BZ2 文件的脚本，他的脚本位于 biopython 模块中。

https://bitbucket.org/james_taylor/bx-python/overview

这些工作得很好，虽然它们不允许在 BZ2 文件中寻找任意字节偏移量，但他的脚本会读取 BZ2 数据块并允许基于块进行寻找。

具体见bx-python / wiki / IO / SeekingInBzip2Files

【讨论】：

请注意，要获取 bzip-table 命令，该命令负责将未压缩的偏移量映射到压缩的偏移量，您还需要 seek-bzip2 存储库，如 james_taylor / bx-python / issues / #14 - Getting Started: Indexing MAFs — Bitbucket 所述