从 gzip 文件写入未压缩文件的内存有效方法答案

【问题标题】：memory efficient way to write an uncompressed file from a gzip file从 gzip 文件写入未压缩文件的内存有效方法
【发布时间】：2023-03-15 03:26:02
【问题描述】：

使用 Python 3.5

我正在解压缩一个 gzip 文件，写入另一个文件。在查看内存不足问题后，我在 gzip 模块的文档中找到了一个示例：

import gzip
import shutil
with open('/home/joe/file.txt', 'rb') as f_in:
    with gzip.open('/home/joe/file.txt.gz', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

这会压缩，我想解压缩，所以我认为我可以反转模式，给

with open(unzipped_file, 'wb') as f_out, gzip.open(zipped_file, 'rb') as f_in:
    shutil.copyfileobj(f_in, f_out)

我的问题是，为什么我会遇到以下问题：

with gzip.open(zipped_file, 'rb') as zin, open(unzipped_file, 'wb') as wout:
    wout.write(zin.read())

要么我压倒了最后一根稻草，要么我天真地认为文件会像生成器一样运行并流式处理解压缩过程，占用很少的内存。这两种方法应该等价吗？

【问题讨论】：

我建议你看一下shutil.copyfileObj方法的代码。
"我天真地认为文件会像生成器一样工作" -- 就是这样。你是在自问自答。为了证明，请尝试 print(type(zin.read()) 获取一些较小的文件。

标签： python python-3.x gzip generator shutil

【解决方案1】：

这里是shutil.copyfileObj 方法。

def copyfileobj(fsrc, fdst, length=16*1024):
    """copy data from file-like object fsrc to file-like object fdst"""
    while 1:
        buf = fsrc.read(length)
        if not buf:
            break
        fdst.write(buf)

它以长度为 16*1024 的块读取文件。当你试图逆转这个过程时，你并没有考虑到文件的大小，它会被读入内存并让你陷入内存问题。

【讨论】：

我认为这是更清洁的解决方案，而不是像我展示的那样从本质上复制它。

【解决方案2】：

而不是记忆饥饿（和幼稚）

import gzip
with gzip.open(zipped_file, 'rb') as zin, open(unzipped_file, 'wb') as wout:
     wout.write(zin.read())

根据我之前测试过的答案：

import gzip
block_size = 64*1024
with gzip.open(zipped_file, 'rb') as zin, open(unzipped_file, 'wb') as wout:
while True:
    uncompressed_block = zin.read(block_size)
    if not uncompressed_block:
        break
    wout.write(uncompressed_block)

已在 4.8G 文件上验证。

【讨论】：

我觉得这一定是在其他地方被问过和回答过的。谁有链接？
@Vinit 和 Jean-Francois 提到的解决方案是最好的。