正则表达式模式的新手。如何捕获多行？答案

【问题标题】：Newbie in regex patterns. How to capture multiple lines?正则表达式模式的新手。如何捕获多行？
【发布时间】：2020-08-25 12:08:35
【问题描述】：

我对正则表达式模式很陌生。我很难解析文本文件并返回每个段落的匹配项。所以基本上每个段落都是独一无二的。

这是我的示例文本文件

A quick brown
fox jumps over
the lazy dog;
1234;

Here is
the second paragraph
123141

我希望 match[0] 是： #快速棕色狐狸跳过懒狗； 1234;

匹配[1] 为：＃这是第二段 123141

我试过了

regex = re.compile(r"(.*\n)\n", re.MULTILINE)
   with open(file_dir, "r") as file:
      matches = regex.findall(file.read())
print matches

但结果是 ['1234;\n']。它没有捕获整个段落，也没有捕获第二个段落。最有效的方法是什么？

【问题讨论】：

见stackoverflow.com/questions/41620093/…
你到底想要什么？ 所以基本上每个段落都是独一无二的 是分隔符吗？请在没有 cmets 的情况下提供更通用的输入文件详细信息

标签： python regex

【解决方案1】：

试试(\S[\s\S]*?)(?:\n\n|$):

\S 匹配非空白字符
[\s\S]*? 匹配 0 个或多个空白或非空白字符，即非贪婪地匹配任何类型的字符，包括换行符。项目 1 和 2 在捕获组 1 中。
(?:\n\n|$) 匹配非捕获组中的两个连续换行符或 $（匹配字符串结尾或字符串结尾之前的换行符）。

Regex Demo

代码：

import re

s = """A quick brown
fox jumps over
the lazy dog;
1234;

Here is
the second paragraph
123141"""

matches = re.findall(r'(\S[\s\S]*?)(?:\n\n|$)', s)
print(matches)

打印：

['A quick brown\nfox jumps over\nthe lazy dog;\n1234;', 'Here is\nthe second paragraph\n123141']

或者，您可以使用：

\S(?:(?!\n\n)[\s\S])*

它使用否定的 looahead 断言，并且与之前的正则表达式的成本大致相同。此正则表达式首先查找非空白字符，然后只要以下输入流不包含两个连续的换行符，就会继续扫描另一个字符。

Regex Demo

【讨论】：

感谢分享。我认为我很难创建正则表达式的原因是因为使用了多行函数。尽管您的第二个答案在多行中也适用。
您也许可以在第二个正则表达式版本中使用多行，但这无关紧要，因为模式中没有使用^ 或$，这就是re。多线影响。在第一个正则表达式中re.MULTILINE 肯定会出错。

【解决方案2】：

这是一个好的开始：

(?:.+\s)+

测试一下here

测试代码：

import re

regex = r"(?:.+\s)+"

test_str = ("A quick brown\n"
    "fox jumps over\n"
    "the lazy dog;\n"
    "1234;\n\n"
    "Here is\n"
    "the second paragraph\n"
    "123141")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

输出：

Match 1 was found at 0-49: A quick brown
fox jumps over
the lazy dog;
1234;

Match 2 was found at 50-79: Here is
the second paragraph

可以看到最后一段的最后一行被截断了。为了避免这种情况，在匹配正则表达式之前，在字符串末尾添加一个\n，这样正则表达式就可以检测到段落的结尾： test_str += '\n'

你可以试试here，不带\n，带here。

【讨论】：

不，代码是从regex101.com 生成的，我用它来创建正则表达式示例和python 代码。
转到“生成的代码”部分，您将可以访问多种编程语言，您可以在 tio.run 上进行测试