Python：如何遍历行块答案

【问题标题】：Python: How to loop through blocks of linesPython：如何遍历行块
【发布时间】：2011-04-24 06:55:23
【问题描述】：

如何遍历由空行分隔的行块？该文件如下所示：

ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
Age: 23

ID: 3
Name: S
FamilyN: Y
Age: 13

ID: 4
Name: M
FamilyN: Z
Age: 25

我想遍历块并在 3 列列表中获取字段名称、姓氏和年龄：

Y X 20
F H 23
Y S 13
Z M 25

【问题讨论】：

标签： python text-processing

【解决方案1】：

使用 dict、namedtuple 或自定义类在遇到每个属性时存储它，然后在到达空行或 EOF 时将对象附加到列表中。

【讨论】：

【解决方案2】：

使用生成器。

def blocks( iterable ):
    accumulator= []
    for line in iterable:
        if start_pattern( line ):
            if accumulator:
                yield accumulator
                accumulator= []
        # elif other significant patterns
        else:
            accumulator.append( line )
     if accumulator:
         yield accumulator

【讨论】：

只是为了增加一点趣味：在重新初始化累加器后说continue 并取出else: 相同的控制流，但缩进少了。这是一个品味问题。此外，“悬空收益率”应该是有条件的：if accumulator: yield accumulator；这样可以避免产生虚假的空列表。

【解决方案3】：

import re
result = re.findall(
    r"""(?mx)           # multiline, verbose regex
    ^ID:.*\s*           # Match ID: and anything else on that line 
    Name:\s*(.*)\s*     # Match name, capture all characters on this line
    FamilyN:\s*(.*)\s*  # etc. for family name
    Age:\s*(.*)$        # and age""", 
    subject)

结果将是

[('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]

可以简单地更改为您想要的任何字符串表示形式。

【讨论】：

每次我在代码中尝试 re.findall() 时，它都会给我这个错误消息： File "/usr/lib/python2.6/re.py", line 177, in findall return _compile(pattern, flags).findall(string) TypeError: 预期的字符串或缓冲区。是什么原因？
好吧，错误信息表明您没有向它传递字符串。那么你传递给它的是什么？

【解决方案4】：

如果文件不是很大，你可以读取整个文件：

content = f.open(filename).read()

然后您可以使用以下方法将content 拆分为块：

blocks = content.split('\n\n')

现在您可以创建函数来解析文本块。我会使用split('\n') 从块中获取行，使用split(':') 来获取键和值，最终使用str.strip() 或一些正则表达式的帮助。

如果不检查块是否具有所需的数据，代码可能如下所示：

f = open('data.txt', 'r')
content = f.read()
f.close()
for block in content.split('\n\n'):
    person = {}
    for l in block.split('\n'):
        k, v = l.split(': ')
        person[k] = v
    print('%s %s %s' % (person['FamilyN'], person['Name'], person['Age']))

【讨论】：

【解决方案5】：

简单的解决方案：

result = []
for record in content.split('\n\n'):
    try:
        id, name, familyn, age = map(lambda rec: rec.split(' ', 1)[1], record.split('\n'))
    except ValueError:
        pass
    except IndexError:
        pass
    else:
        result.append((familyn, name, age))

【讨论】：

【解决方案6】：

如果您的文件太大而无法一次全部读入内存，您仍然可以使用基于正则表达式的解决方案，使用内存映射文件，mmap module：

import sys
import re
import os
import mmap

block_expr = re.compile('ID:.*?\nAge: \d+', re.DOTALL)

filepath = sys.argv[1]
fp = open(filepath)
contents = mmap.mmap(fp.fileno(), os.stat(filepath).st_size, access=mmap.ACCESS_READ)

for block_match in block_expr.finditer(contents):
    print block_match.group()

mmap 技巧将提供一个“假装字符串”以使正则表达式在文件上工作，而无需将其全部读入一个大字符串。并且正则表达式对象的 find_iter() 方法将产生匹配项，而不会一次创建所有匹配项的完整列表（findall() 会这样做）。

我确实认为这个解决方案对于这个用例来说太过分了（不过：这是一个很好的技巧......）

【讨论】：

【解决方案7】：

导入迭代工具

# Assuming input in file input.txt
data = open('input.txt').readlines()

records = (lines for valid, lines in itertools.groupby(data, lambda l : l != '\n') if valid)    
output = [tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records]

# You can change output to generator by    
output = (tuple(field.split(':')[1].strip() for field in itertools.islice(record, 1, None)) for record in records)

# output = [('X', 'Y', '20'), ('H', 'F', '23'), ('S', 'Y', '13'), ('M', 'Z', '25')]    
#You can iterate and change the order of elements in the way you want    
# [(elem[1], elem[0], elem[2]) for elem in output] as required in your output

【讨论】：

可以将理解转换为“for循环”以使其更具可读性。

【解决方案8】：

这是另一种方式，使用itertools.groupby。函数groupy 遍历文件的行并为每个line 调用isa_group_separator(line)。 isa_group_separator 返回 True 或 False（称为 key），itertools.groupby 然后将产生相同 True 或 False 结果的所有连续行分组。

这是一种非常方便的方式来将线条收集到组中。

import itertools

def isa_group_separator(line):
    return line=='\n'

with open('data_file') as f:
    for key,group in itertools.groupby(f,isa_group_separator):
        # print(key,list(group))  # uncomment to see what itertools.groupby does.
        if not key:
            data={}
            for item in group:
                field,value=item.split(':')
                value=value.strip()
                data[field]=value
            print('{FamilyN} {Name} {Age}'.format(**data))

# Y X 20
# F H 23
# Y S 13
# Z M 25

【讨论】：

当我取消注释（注释的）print 语句时，程序的行为非常奇怪，并为FamilyN 创建了一个KeyError。实际上，if not key 之后的group 是空的。这很奇怪。你能解释一下发生了什么吗？谢谢。

【解决方案9】：

除了我已经在这里看到的六种其他解决方案之外，我有点惊讶于没有人像建议，例如，

fp = open(fn)
def get_one_value():
    line = fp.readline()
    if not line:
        return None
    parts = line.split(':')
    if 2 != len(parts):
        return ''
    return parts[1].strip()

# The result is supposed to be a list.
result = []
while 1:
        # We don't care about the ID.
   if get_one_value() is None:
       break
   name = get_one_value()
   familyn = get_one_value()
   age = get_one_value()
   result.append((name, familyn, age))
       # We don't care about the block separator.
   if get_one_value() is None:
       break

for item in result:
    print item

重新格式化以适应口味。

【讨论】：

嗨，卡梅伦。这是 Oneliner 轿车；进入时将惊喜停在酒保处。您可能还会注意到，即使有任何答案，也很少包括检查正在读取的文件是否与询问者的示例完全相似。
你不是 18 世纪初将圆周率计算到 100 位的 John Machin，是吗？感谢您的欢迎。我明白你的意思； '至少，我想我会......在没有段落划分的评论约束下，我会这样总结：“简单”取决于一个人的立场，以及一个人面对的方式。

【解决方案10】：

这个答案不一定比已经发布的更好，但作为我如何处理此类问题的说明，它可能很有用，特别是如果您不习惯使用 Python 的交互式解释器。

我开始知道关于这个问题的两件事。首先，我将使用itertools.groupby 将输入分组到数据行列表中，每个单独的数据记录一个列表。其次，我想将这些记录表示为字典，以便我可以轻松地格式化输出。

这表明的另一件事是，使用生成器可以轻松地将此类问题分解为小部分。

>>> # first let's create some useful test data and put it into something 
>>> # we can easily iterate over:
>>> data = """ID: 1
Name: X
FamilyN: Y
Age: 20

ID: 2
Name: H
FamilyN: F
Age: 23

ID: 3
Name: S
FamilyN: Y
Age: 13"""
>>> data = data.split("\n")
>>> # now we need a key function for itertools.groupby.
>>> # the key we'll be grouping by is, essentially, whether or not
>>> # the line is empty.
>>> # this will make groupby return groups whose key is True if we
>>> care about them.
>>> def is_data(line):
        return True if line.strip() else False

>>> # make sure this really works
>>> "\n".join([line for line in data if is_data(line)])
'ID: 1\nName: X\nFamilyN: Y\nAge: 20\nID: 2\nName: H\nFamilyN: F\nAge: 23\nID: 3\nName: S\nFamilyN: Y\nAge: 13\nID: 4\nName: M\nFamilyN: Z\nAge: 25'

>>> # does groupby return what we expect?
>>> import itertools
>>> [list(value) for (key, value) in itertools.groupby(data, is_data) if key]
[['ID: 1', 'Name: X', 'FamilyN: Y', 'Age: 20'], ['ID: 2', 'Name: H', 'FamilyN: F', 'Age: 23'], ['ID: 3', 'Name: S', 'FamilyN: Y', 'Age: 13'], ['ID: 4', 'Name: M', 'FamilyN: Z', 'Age: 25']]
>>> # what we really want is for each item in the group to be a tuple
>>> # that's a key/value pair, so that we can easily create a dictionary
>>> # from each item.
>>> def make_key_value_pair(item):
        items = item.split(":")
        return (items[0].strip(), items[1].strip())

>>> make_key_value_pair("a: b")
('a', 'b')
>>> # let's test this:
>>> dict(make_key_value_pair(item) for item in ["a:1", "b:2", "c:3"])
{'a': '1', 'c': '3', 'b': '2'}
>>> # we could conceivably do all this in one line of code, but this 
>>> # will be much more readable as a function:
>>> def get_data_as_dicts(data):
        for (key, value) in itertools.groupby(data, is_data):
            if key:
                yield dict(make_key_value_pair(item) for item in value)

>>> list(get_data_as_dicts(data))
[{'FamilyN': 'Y', 'Age': '20', 'ID': '1', 'Name': 'X'}, {'FamilyN': 'F', 'Age': '23', 'ID': '2', 'Name': 'H'}, {'FamilyN': 'Y', 'Age': '13', 'ID': '3', 'Name': 'S'}, {'FamilyN': 'Z', 'Age': '25', 'ID': '4', 'Name': 'M'}]
>>> # now for an old trick:  using a list of column names to drive the output.
>>> columns = ["Name", "FamilyN", "Age"]
>>> print "\n".join(" ".join(d[c] for c in columns) for d in get_data_as_dicts(data))
X Y 20
H F 23
S Y 13
M Z 25
>>> # okay, let's package this all into one function that takes a filename
>>> def get_formatted_data(filename):
        with open(filename, "r") as f:
            columns = ["Name", "FamilyN", "Age"]
            for d in get_data_as_dicts(f):
                yield " ".join(d[c] for c in columns)

>>> print "\n".join(get_formatted_data("c:\\temp\\test_data.txt"))
X Y 20
H F 23
S Y 13
M Z 25

【讨论】：

感谢您的精彩回答 :)