如何逐块读取大文件并通过块头判断？答案

【问题标题】：How to read a large file block by block and judge by block header?如何逐块读取大文件并通过块头判断？
【发布时间】：2019-04-29 19:37:42
【问题描述】：

我有一个大文件，我想通过匹配标题逐块读取它。比如文件是这样的：

@header1
a b c 1 2 3
c d e 2 3 4
q w e 3 4 5


@header2
e 89 78 56
s 68 77 26
...

我写了一个这样的脚本：

with open("filename") as f:
  line=f.readline()
  if line.split()[0]=="@header1":
     list1.append(f.readline().split()[0])
     list2.append(f.readline().split()[1])
     ...
  elif line.split()[0]=="@header2":
     list6.append(f.readline().split()[0])
     list7.append(f.readline().split()[1])
     ...

但它似乎只读取了第一个标题而没有读取第二个块。此外，这些块之间还有一些空行。当行匹配某些字符串时如何读取块并跳过那些空行。

我知道在 C 中，它会是 switch。如何在python中做类似的事情？

【问题讨论】：

您需要添加更多详细信息。这些多个空格分隔的文件段是否在一个文件中？ @header... 是否保证按顺序连续编号？如果@header1 单独出现，你为什么要测试line.split()[0]=="@header2" 而不是简单的line == "@header2"？或者只是 line.startswith('@header') ，应该将它们全部捕获，甚至不需要正则表达式？
最终我希望您想要读取以空格分隔的行内容（在每个部分中，根据其标题），因此您需要包装一个阅读器对象。或者分别编写一个生成器到yield每个行块，这样你就可以将它传递给一个读取器对象。
“另外，这些块之间还有一些空行。” 那么，你保证空行只能出现在部分之外，而不是在里面吗？跨度>

标签： python

【解决方案1】：

IMO，您的误解是关于如何读取 csv 文件。至少我怀疑从 C 中“切换”在这里的帮助比使用 if 子句所能做的更多。

但是，请理解，您必须逐行遍历文件。也就是说，如果你不知道之前的长度，没有什么可以处理整个块。

所以你的算法是这样的：

对于文件中的每一行：
. .is 标题？
. . .然后准备这个特定的标题
. . 是空行吗？
. . .然后跳过
. .是数据吗？
. . .然后根据上面的准备追加

在代码中可能是这样的

block_ctr = -1
block_data = []
with open(filename) as f:
    for line in f:                   
        if line:                         # test if line is not empty
            if line.startswith('@header'):
                block_ctr += 1
                block_data.append([])
            else:
                block_data[block_ctr].append(line.split())

【讨论】：

它适用于生成器方法，请参阅我的答案

【解决方案2】：

我不知道你到底想达到什么目标，但可能是这样的：

with open(filename) as f:
    for line in f:
        if line.startswith('@'):
            print('header')
            # do something with header here
        else:
            print('regular line')
            # do something with the line here

【讨论】：

【解决方案3】：

附在底部的是使用 Python 生成器 split_into_chunks(f) 提取每个部分（作为字符串列表）、压制空行、检测丢失的 @headers 和 EOF 的解决方案。生成器方法非常简洁，因为它允许您进一步包装，例如处理空格分隔值的 CSV 读取器对象（例如 pandas read_csv）：

with open('your.ssv') as f:
    for chunk in split_into_chunks(f):
        # Do stuff on chunk. Presumably, wrap a reader e.g. pandas read_csv
        # print(chunk)

代码如下。我还为您参数化了值demarcator='@header'。请注意，我们必须使用line = inputstream.readline()、while line 进行迭代，而不是通常使用for line in f 进行迭代，因为如果我们看到下一节的@header，我们需要使用seek/tell() 进行回推；请参阅 this 和 this 了解原因。如果您想修改生成器以分别生成块头和主体（例如，作为两个项目的列表），那很简单。

def split_into_chunks(inputstream, demarcator='@header'):
    """Utility generator to get sections from file, demarcated by '@header'"""

    while True:
        chunk = []

        line = inputstream.readline()
        # At EOF?
        if not line: break
        # Expect that each chunk starts with one header line
        if not line.startswith(demarcator):
            raise RuntimeError(f"Bad chunk, missing {demarcator}")

        chunk.append(line.rstrip('\n'))

        # Can't use `for line in inputstream:` since we may need to pushback
        while line:
            # Remember our file-pointer position in case we need to pushback a header row
            last_pos = inputstream.tell()
            line = inputstream.readline()

            # Saw next chunk's header line? Pushback the header line, then yield the current chunk
            if line.startswith(demarcator):
                inputstream.seek(last_pos)
                break

            # Ignore blank or whitespace-only lines
            #line = line.rstrip('\n')
            if line:
                chunk.append(line.rstrip('\n'))

        yield chunk


with open('your.ssv') as f:
    for chunk in split_into_chunks(f):
        # Do stuff on chunk. Presumably, wrap it with a reader which handles space-sparated value, e.g. pandas read_csv
        print(chunk)

【讨论】：

【解决方案4】：

我看到另一个类似这个问题的帖子并在这里复制了这个想法。我同意 SpghttCd 是对的，尽管我没有尝试过。

    with open(filename) as f:
        #find each line number that contains header
        for i,line in enumerate(f,1):
            if 'some_header' in line:
                index1=i
            elif 'another_header' in line:
                index2=i
            ...
    with open(filename) as f:
        #read the first block:
        for i in range(int(index1)):
            line=f.readline()
        for i in range('the block size'):
            'read, split and store'
        f.seek(0)
        #read the second block, third and ... 
        ...

【讨论】：