解析换行符分隔文件答案

【问题标题】：Parsing newline delimited file解析换行符分隔文件
【发布时间】：2015-05-26 00:54:07
【问题描述】：

我正在做一个项目，我想使用 Python 解析一个文本文件。该文件由不同块格式的一些数据条目组成。当有新行时，会找到一个新条目。这就是我想要完成的：

跳过前几行（前 16 行）
在第 16 行之后，有一个换行符开始新的数据输入
阅读以下行，直到遇到新的换行符。每一行都附加到一个名为 data 的列表中。
该列表将传递给处理进一步处理的函数。
重复步骤 3 和 4，直到文件中没有更多数据

这是文件的示例：

Header Info
More Header Info

Line1
Line2
Line3
Line4
Line5
Line6
Line7
Line8
Line9
Line10
Line11
Line12
Line13

MoreInfo    MoreInfo    MoreInfo    MoreInfo    MoreInfo
MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2
MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3
MoreInfo4   MoreInfo4
FieldName1  0001    0001
FieldName1  0002    0002
FieldName1  0003    0003
FieldName1  0004    0004
FieldName1  0005    0005
FieldName2  0001    0001
FieldName3  0001    0001
FieldName4  0001    0001
FieldName5  0001    0001
FieldName6  0001    0001

MoreInfo    MoreInfo    MoreInfo    MoreInfo    MoreInfo
MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2
MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3
MoreInfo4   MoreInfo4
FieldName1  0001    0001
FieldName1  0002    0002
FieldName1  0003    0003
FieldName1  0004    0004
FieldName1  0005    0005
FieldName2  0001    0001
FieldName3  0001    0001
FieldName4  0001    0001
FieldName5  0001    0001
FieldName6  0001    0001

这是我处理过的一些代码。它能够读取第一个块并将其附加到列表中：

with open(loc, 'r') as f:
    for i in range(16):
        f.readline()

    data = []
    line = f.readline()
    if line == "\n":
        dataLine = f.readline()
        while dataLine != "\n":
            data.append(dataLine)
            dataLine = f.readline()

    #pass data list to function
    function_call(data)
    # reset data list here?
    data = []

如何使它适用于完整文件？我的假设是使用“with open”，它充当“虽然不是文件结尾”。在跳过前 16 行后，我尝试添加“while True”。我对 Python 的解析能力知之甚少。

感谢您的任何帮助。

【问题讨论】：

First：'我的假设是使用“with open”，它充当“虽然不是文件结尾”。'那是错误的。 with open 不做任何循环；它只是确保你opened 完成后得到closed 的文件。
更重要的是：'我尝试在跳过前 16 行后添加“while True”'是一个非常好的方法。如果它对你不起作用，显然你有问题。如果您向我们展示您尝试过的代码，我们可以向您展示如何修复它；如果你只是描述它，没有人能为你做很多事情。
您应该考虑使用 itertools.groupby() 并创建一个关键函数，当它自己看到 \n 时会发生变化。
即：您需要“重复”您已经编写的代码块来读取第一个数据块。

标签： python fileparsing

【解决方案1】：

在初始跳过后添加while True 绝对可以。当然，您必须正确了解所有细节。

您可以尝试扩展您已有的方法，在外循环内使用嵌套的while 循环。但将其视为单个循环可能更容易。对于每一行，您可能只需要做三件事：

如果没有行，因为你在EOF，break退出循环，确保处理旧的data（文件中的最后一个块）如果有是第一个。
如果是空行，则开始一个新的data，如果有，请确保处理旧的data。
否则，追加到现有的data。

所以：

with open(loc, 'r') as f:
    for i in range(16):
        f.readline()

    data = []
    while True:
        line = f.readline()
        if not line:
            if data:
                function_call(data)
            break
        if line == "\n":
            if data:
                function_call(data)
                data = []
        else:
            data.append(line)

有几种方法可以进一步简化：

使用for line in f: 而不是while 循环重复执行f.readline() 并检查它。
使用groupby 将行迭代器转换为以空行分隔的行组的迭代器。

【讨论】：

非常感谢@abarnert。这有效并帮助我防止了更多的头痛。我将考虑使用“for Lien in f:”或使用 groupby 来重构代码。

【解决方案2】：

如果您仍在为此苦苦挣扎，这里有一个使用itertools.groupby() 和关键函数search() 读取示例数据的实现：

from itertools import groupby, repeat

def search(d):
    """Key function used to group our dataset"""

    return d[0] == "\n"

def read_data(filename):
    """Read data from filename and return a nicer data structure"""

    data = []

    with open(filename, "r") as f:
        # Skip first 16 lines
        for _ in repeat(None, 16):
            f.readline()

        # iterate through each data block
        for newblock, records in groupby(f, search):
            if newblock:
                # we've found a new block
                # create a new row of data
                data.append([])
            else:
                # we've found data for the current block
                # add each row to the last row
                for row in records:
                    row = row.strip().split()
                    data[-1].append(row)

    return data

这将产生一个嵌套的块列表的数据结构。每个子列表由数据文件中的 \n 分组分开。

【讨论】：

【解决方案3】：

文件中块的模式是它们由以空行或文件结尾终止的行组组成。这个逻辑可以封装在一个生成器函数中，该函数迭代地从文件中生成行块，这将简化脚本的其余部分。

在下文中，getlines() 是生成器函数。另请注意，文件的前 17 行会被跳过以到达第一个块的开头。

from pprint import pformat

loc = 'parsing_test_file.txt'

def function(lines):
    print('function called with:\n{}'.format(pformat(lines)))

def getlines(f):
    lines = []
    while True:
        try:
            line = next(f)
            if line != '\n':  # not end of the block?
                lines.append(line)
            else:
                yield lines
                lines = []
        except StopIteration:  # end of file
            if lines:
                yield lines
            break

with open(loc, 'r') as f:
    for i in range(17):
        next(f)

    for lines in getlines(f):
        function(lines)

print('done')

使用您的测试文件输出：

function called with:
['MoreInfo    MoreInfo    MoreInfo    MoreInfo    MoreInfo\n',
 'MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2\n',
 'MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3\n',
 'MoreInfo4   MoreInfo4\n',
 'FieldName1  0001    0001\n',
 'FieldName1  0002    0002\n',
 'FieldName1  0003    0003\n',
 'FieldName1  0004    0004\n',
 'FieldName1  0005    0005\n',
 'FieldName2  0001    0001\n',
 'FieldName3  0001    0001\n',
 'FieldName4  0001    0001\n',
 'FieldName5  0001    0001\n',
 'FieldName6  0001    0001\n']
function called with:
['MoreInfo    MoreInfo    MoreInfo    MoreInfo    MoreInfo\n',
 'MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2   MoreInfo2\n',
 'MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3   MoreInfo3\n',
 'MoreInfo4   MoreInfo4\n',
 'FieldName1  0001    0001\n',
 'FieldName1  0002    0002\n',
 'FieldName1  0003    0003\n',
 'FieldName1  0004    0004\n',
 'FieldName1  0005    0005\n',
 'FieldName2  0001    0001\n',
 'FieldName3  0001    0001\n',
 'FieldName4  0001    0001\n',
 'FieldName5  0001    0001\n',
 'FieldName6  0001    0001\n']
done

【讨论】：