【问题标题】：Read .gz files inside .tar files without extracting读取 .tar 文件中的 .gz 文件，无需解压
【发布时间】：2022-01-14 19:26:15
【问题描述】：

我有一个 .tar 文件，其中包含一个文件夹内的许多 .gz 文件。这些 gz 文件中的每一个都包含一个 .txt 文件。与此问题相关的其他 stackoverflow 问题旨在提取文件。

我正在尝试迭代读取每个 .txt 文件的内容而不提取它们，因为 .tar 很大。

首先我阅读了 .tar 文件的内容：

import tarfile
tar = tarfile.open("FILE.tar")
tar.getmembers()

或者在 Unix 中：

tar xvf file.tar -O

然后我尝试使用 tarfile extractfile 方法，但出现错误：“模块 'tarfile' 没有属性 'extractfile'”。此外，我什至不确定这是正确的方法。

import gzip
for member in tar.getmembers():
    m = tarfile.extractfile(member)
    file_contents = gzip.GzipFile(fileobj=m).read()

如果你想创建一个示例文件来模拟原始文件：

$ mkdir directory
$ touch directory/file1.txt.gz directory/file2.txt.gz directory/file3.txt.gz
$ tar -c -f file.tar directory

这是在使用 Mark Adler 的建议后对我有用的最终版本：

import tarfile
tar = tarfile.open("file.tar")
members = tar.getmembers()

# Here I append the results in a list, because I wasn't able to
# parse the tarfile type returned by .getmembers():
tar_name = []
for elem in members:
    tar_name.append(elem.name)

# Then I changed tarfile.extractfile to tar.extractfile as suggested: 
for member in tar_name:
    # I'm using this because I have other non-gzs in the directory
    if member.endswith(".gz"):    
        m=tar.extractfile(member)
        file_contents = gzip.GzipFile(fileobj=m).read()

【问题讨论】：

你还没有，或者至少没有表现出来，很多尝试在这里。展示你尝试了什么、发生了什么以及你期望什么。不要遗漏代码。例如。您将tar.getmembers() 显示为已阅读内容，但除非您阅读了tar = tarfile.open("FILE.tar")，否则这是行不通的。你漏掉了tar =。如果您要在问题中添加代码，请准确地输入您所做的，而不是与您所做的模糊相似的内容。
感谢您的意见。有时在粘贴和重新格式化时，我删除了 tar 变量。
您可以使用tar.getnames()直接获取姓名列表。我敢打赌，你工作程序中的实际内容是if member.endswith(".gz"):。复制和粘贴是您的朋友。
.tar 文件不包含所有成员的目录（如 .zip 文件中），而仅包含文件头和文件数据块的序列。如果要检索成员列表，则必须读取完整文件中的所有标题。这可以通过查找文件中的特定位置来实现，但它会从文件中从头到尾的各个位置读取数据。如果您按照存档中出现的顺序读取一个文件头、检查文件名、提取数据、转到下一个文件头等，整体性能可能会更好。

标签： python unix gzip tar

【解决方案1】：

这里是 unix line / bash 命令：

准备文件：

$ git clone https://github.com/githubtraining/hellogitworld.git
$ cd hellogitworld
$ gzip *
$ ls
build.gradle.gz  fix.txt.gz  pom.xml.gz  README.txt.gz  resources  runme.sh.gz  src
$ cd ..
$ tar -cf hellogitworld.tar hellogitworld/

查看自述文件的方法如下：

$ tar -Oxf hellogitworld.tar hellogitworld/README.txt.gz | zcat

结果：

This is a sample project students can use during Matthew's Git class.

Here is an addition by me

We can have a bit of fun with this repo, knowing that we can always reset it to a known good state.  We can apply labels, and branch, then add new code and merge it in to the master branch.

As a quick reminder, this came from one of three locations in either SSH, Git, or HTTPS format:

* git@github.com:matthewmccullough/hellogitworld.git
* git://github.com/matthewmccullough/hellogitworld.git
* https://matthewmccullough@github.com/matthewmccullough/hellogitworld.git

We can, as an example effort, even modify this README and change it as if it were source code for the purposes of the class.

This demo also includes an image with changes on a branch for examination of image diff on GitHub.

请注意，我与那些 git 存储库没有关联。

焦油说明：

标记-x = 提取
标志-O = 不要将文件写入文件系统，而是写入STDOUT
标记-f = 指定文件

然后剩下的只是将结果传送到 zcat 以在 STDOUT 中查看未压缩的明文

【讨论】：

这个太棒了！以后不使用 Python 时，我肯定会使用它。
在python中也可以使用os.system("ls")来执行系统shell（例如：bash/sh）命令

【解决方案2】：

您需要使用tar.extractfile(member) 而不是tarfile.extractfile(member)。 tarfile 是 class，不知道你打开的 tar 文件。 tar 是 tar 文件 object，它引用了您打开的 .tar 文件。

要正确操作，请使用next() 而不是getmembers() 或getnames()，这样您就不必读取整个 tar 文件两次：

with tarfile.open(sys.argv[1]) as tar:
    while ent := tar.next():
        if ent.name.endswith(".gz"):
            print(gzip.GzipFile(fileobj=tar.extractfile(ent)).read())

【讨论】：