如何使用 zcat 在 Python 中测试 gzip 文件目录并解压缩 gzip 文件？答案

【问题标题】：How to test a directory of files for gzip and uncompress gzipped files in Python using zcat?如何使用 zcat 在 Python 中测试 gzip 文件目录并解压缩 gzip 文件？
【发布时间】：2013-02-26 18:30:16
【问题描述】：

我正在使用 Python 的第二周，我被困在一个压缩/解压缩日志文件的目录中，我需要对其进行解析和处理。

目前我正在这样做：

import os
import sys
import operator
import zipfile
import zlib
import gzip
import subprocess

if sys.version.startswith("3."):
    import io
    io_method = io.BytesIO
else:
    import cStringIO
    io_method = cStringIO.StringIO

for f in glob.glob('logs/*'):
    file = open(f,'rb')        
    new_file_name = f + "_unzipped"
    last_pos = file.tell()

    # test for gzip
    if (file.read(2) == b'\x1f\x8b'):
        file.seek(last_pos)

    #unzip to new file
    out = open( new_file_name, "wb" )
    process = subprocess.Popen(["zcat", f], stdout = subprocess.PIPE, stderr=subprocess.STDOUT)

    while True:
      if process.poll() != None:
        break;

    output = io_method(process.communicate()[0])
    exitCode = process.returncode


    if (exitCode == 0):
      print "done"
      out.write( output )
      out.close()
    else:
      raise ProcessException(command, exitCode, output)

我使用这些 SO 答案 (here) 和博文 (here) 将它们“缝合”在一起

但是，它似乎不起作用，因为我的测试文件是 2.5GB，并且脚本已经咀嚼了 10 多分钟，而且我不确定我所做的是否正确。

问题：
如果我不想使用 GZIP 模块并且需要逐块解压缩（实际文件大于 10GB），如何在 Python 中使用 zcat 和 subprocess 解压缩并保存到文件？

谢谢！

【问题讨论】：

我不清楚你的目标是什么。您是否尝试解压缩目录中的所有文件？相当于：gunzip *.gz ?你对使用 gzip 模块有什么特别的反对吗？
该目录包含压缩和解压缩的文件。我需要在一个进程中处理这两个，所以我的想法是（1）首先运行目录，（2）选择压缩文件并解压缩到新文件（3）然后进行第二次运行处理。不确定这是否是最好的方法
re: 反对gzip，不是吗，gzip 很慢——就像提到的here？
您需要查找日志文件，还是一次性读取它们就足够了？
我需要检索每个文件（压缩/解压缩）的第一个日志条目（行），提取日期并将其与文件路径一起存储。下一轮将按排序顺序逐行处理日志文件。

标签： python logging gzip compression zcat

【解决方案1】：

这应该读取日志子目录中每个文件的第一行，并根据需要解压缩：

#!/usr/bin/env python

import glob
import gzip
import subprocess

for f in glob.glob('logs/*'):
  if f.endswith('.gz'):
    # Open a compressed file. Here is the easy way:
    #   file = gzip.open(f, 'rb')
    # Or, here is the hard way:
    proc = subprocess.Popen(['zcat', f], stdout=subprocess.PIPE)
    file = proc.stdout
  else:
    # Otherwise, it must be a regular file
    file = open(f, 'rb')

  # Process file, for example:
  print f, file.readline()

【讨论】：

啊。感谢您的澄清:-)