Python Regex - 在文本文件中的（多个）表达式之间提取文本答案

【问题标题】：Python Regex - Extract text between (multiple) expressions in a textfilePython Regex - 在文本文件中的（多个）表达式之间提取文本
【发布时间】：2018-11-06 09:55:14
【问题描述】：

我是一名 Python 初学者，如果您能帮助我解决我的文本提取问题，我将不胜感激。

我想提取文本文件中两个表达式之间的所有文本（字母的开头和结尾）。对于这两者，字母的开头和结尾都有多种可能的表达方式（在列表“letter_begin”和“letter_end”中定义，例如“Dear”、“to our”等）。我想分析一堆文件，在下面找到一个这样的文本文件看起来如何的示例->我想提取从“亲爱的”到“道格拉斯”的所有文本。如果 "letter_end" 不匹配，即没有找到 letter_end 表达式，则输出应从 letter_beginning 开始，并在要分析的文本文件的最后结束。

编辑：“记录的文本”的结尾必须在“letter_end”的匹配之后，并且在第一行之前有 20 个或更多字符（就像“这里的随机文本”一样 -> len= 24.

"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""

到目前为止，这是我的代码 - 但它不能灵活地捕捉表达式之间的文本（可以有任何东西（行、文本、数字、符号等）在“letter_begin”之前和“letter_end”之后")

import re

letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"


with open(filename, 'r', encoding="utf-8") as infile:
         text = infile.read()
         text = str(text)
         output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
         print (output)

我非常感谢每一个帮助！

【问题讨论】：

你说我想提取从“Dear”到“Douglas”的所有文本，但是你的正则表达式没有Douglas。 ,\n\S+ 会阻止正则表达式匹配，即使您将其添加到 letter_end。也许你想要的只是regex = r"(?:" + openings + r").*?" + r"(?:" + closings + r")"？
@WiktorStribiżew：非常感谢您的帮助 - 这看起来已经很不错了！您知道如何在定义的“letter_end”之后获得接下来的 5 个单词吗？（所以我可以得到结束表达式之后的任何名称？）
你如何定义“单词”？它们之间可以有什么字符？ Look here，如果你匹配 5 个单词，你可能得到的不仅仅是 Douglas。
好的，我明白了。有没有办法告诉正则表达式在“letter_end”之后获取“下两行”，因为“其他随机文本”只会从 letter_end 开始至少 3 行？ -> r"(?:" + openings + r").*?" + r"(?:" + 结束语 + [\Line+\Line+){0,2} r")" ?
删除re.DOTALL并使用regex101.com/r/PmU3Ti/2，即regex = r"(?:" + openings + r")[\s\S]*?" + r"(?:" + closings + r").*(?:\n.*){0,2}"。你也不需要re.MULTILINE，顺便说一句。

标签： python regex text-mining text-extraction

【解决方案1】：

你可以使用

regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)

这种模式会产生像这样的正则表达式

(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}

请参阅regex demo。请注意，您不应在此模式中使用 re.DOTALL，re.MULTILINE 选项也是多余的。

详情

(?:dear|to our|estimated) - 三个值中的任何一个
[\s\S]*? - 任何 0+ 个字符，尽可能少
(?:sincerely|yours|best regards) - 三个值中的任何一个
.* - 除换行符以外的任何 0+ 个字符
(?:\n.*){0,2} - 零、一或两次重复换行符后跟除换行符以外的任何 0+ 字符。

Python demo code:

import re
text="""Some random text here

Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards 
Douglas

Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter 
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter 
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))

输出：

['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']

【讨论】：

非常感谢维克托！我需要对正则表达式代码进行最后一次编辑：我需要输出文本在“letter_end”匹配后的第一行之前停止，并且该行中有超过 20 个字符。在上面的示例中，它将生成与 len("Random text here as well") = 24 相同的输出。在正则表达式语句末尾要满足的条件：在匹配 "letter_end" 后的行处停止，其中该行包含 > 20 个字符）
@DominikScheld r"(?:{})[\s\S]*?(?:{}).*(?:\n.{{0,19}}$)*" but you need to use re.M` 标记它。 Here is a demo