重复提取文本文件中两个分隔符之间的行，Python答案

【问题标题】：Repeatedly extract a line between two delimiters in a text file, Python重复提取文本文件中两个分隔符之间的行，Python
【发布时间】：2011-08-17 19:47:03
【问题描述】：

我有一个格式如下的文本文件：

DELIMITER1
extract me
extract me
extract me
DELIMITER2

我想提取 .txt 文件中 DELIMITER1 和 DELIMITER2 之间的每个 extract mes 块

这是我当前的不良代码：

import re
def GetTheSentences(file):
     fileContents =  open(file)
     start_rx = re.compile('DELIMITER')
     end_rx = re.compile('DELIMITER2')

     line_iterator = iter(fileContents)
     start = False
     for line in line_iterator:
           if re.findall(start_rx, line):

                start = True
                break
      while start:
           next_line = next(line_iterator)
           if re.findall(end_rx, next_line):
                break

           print next_line

           continue
      line_iterator.next()

有什么想法吗？

【问题讨论】：

标签： python regex

【解决方案1】：

您可以使用re.S 将其简化为一个正则表达式，即DOTALL flag。

import re
def GetTheSentences(infile):
     with open(infile) as fp:
         for result in re.findall('DELIMITER1(.*?)DELIMITER2', fp.read(), re.S):
             print result
# extract me
# extract me
# extract me

这也利用了非贪婪运算符.*?，因此将找到多个不重叠的DELIMITER1-DELIMITER2对块。

【讨论】：

提示：如果您的文件太大而无法一次全部读取，请将此与内存映射文件对象（通过 mmap 模块）一起使用。
@Brent 试过了，效果很好……谢谢！
很高兴我能帮上忙。如果答案是您问题的最佳答案，请不要忘记将其标记为已接受。

【解决方案2】：

如果分隔符在一行内：

def get_sentences(filename):
    with open(filename) as file_contents:
        d1, d2 = '.', ',' # just example delimiters
        for line in file_contents:
            i1, i2 = line.find(d1), line.find(d2)
            if -1 < i1 < i2:
                yield line[i1+1:i2]


sentences = list(get_sentences('path/to/my/file'))

如果他们在自己的线上：

def get_sentences(filename):
    with open(filename) as file_contents:
        d1, d2 = '.', ',' # just example delimiters
        results = []
        for line in file_contents:
            if d1 in line:
                results = []
            elif d2 in line:
                yield results
            else:
                results.append(line)

sentences = list(get_sentences('path/to/my/file'))

【讨论】：

回溯（最近一次调用最后一次）：文件“”，第 1 行，在文件“”，第 10 行，在 get_sentences UnboundLocalError：局部变量“结果”被引用分配前
@amadain 我添加了一行来初始化结果，但是看着这个我不确定它是否正确。

【解决方案3】：

这应该做你想做的：

import re
def GetTheSentences(file):
    start_rx = re.compile('DELIMITER')
    end_rx = re.compile('DELIMITER2')

    start = False
    output = []
    with open(file, 'rb') as datafile:
         for line in datafile.readlines():
             if re.match(start_rx, line):
                 start = True
             elif re.match(end_rx, line):
                 start = False
             if start:
                  output.append(line)
    return output

您以前的版本看起来应该是一个迭代器函数。您希望您的输出一次返回一项吗？这略有不同。

【讨论】：

无需将整个文件读入内存。如果像在一行中查找特定子字符串一样简单，您也不需要正则表达式。
@agf 当然不是，但他的简单示例可能与他的数据不完全对应。我对 postscript 文件做了非常相似的事情，我绝对必须为我的起点和终点使用正则表达式。
@Renklauf 没问题，这就是我们来这里的目的。不过，您可能想选择一个作为答案...

【解决方案4】：

这是列表推导的好工作，不需要正则表达式。第一个列表 comp 清除打开 txt 文件时找到的文本行列表中的典型 \n。第二个列表 comp 仅使用 in 运算符来识别要过滤的序列模式。

def extract_lines(file):
    scrubbed = [x.strip('\n') for x in open(file, 'r')]
    return [x for x in scrubbed if x not in ('DELIMITER1','DELIMITER2')]

【讨论】：

这将返回除这两行之外的整个文件，而不是分隔符之间的行。