【问题标题】:How to extract text between matching strings including match strings and lines如何在匹配字符串之间提取文本,包括匹配字符串和行
【发布时间】:2017-03-30 17:44:38
【问题描述】:

我正在研究 python 以在匹配字符串之间提取某些字符串。这些字符串是从一个列表中生成的,该列表再次由一个单独的 python 函数动态生成。我正在处理的列表如下所示:-

sample_list = ['line1 this line a first line',
        'line1 this line is also considered as line one...',
        'line1 this line is the first line',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 this contain other strings',
        'line1 this may contain other strings as well',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 what the heck is it...'
        ]

我想要的输出是这样的:-

line1 this line is the first line
line2 this line is second line to be included in output
line3 this should also be included in output
line1 this may contain other strings as well
line2 this line is second line to be included in output
line3 this should also be included in output

如您所见,我想提取以 line1 开头并以 line3(直到行尾) 结尾的文本/行。最终输出包括匹配的单词(即 line1 和 line3)。

我试过的代码是:-

# Convert list to string first
list_to_str = '\n'.join(sample_list)
# Get desired output
print(re.findall('\nline1(.*?)\nline2(.*?)\nline3($)', list_to_str, re.DOTALL))

这是我得到的输出 ():-

[]

感谢任何帮助。

编辑1:- 我做了一些工作,找到了这个最接近的解决方案:-

matches = (re.findall(r"^line1(.*)\nline2(.*)\nline3(.*)$", list_to_str, re.MULTILINE))

for match in matches:
    print('\n'.join(match))

它给了我这个输出:-

 this line is the first line
 this line is second line to be included in output
 this is the third and it should also be included in output
 this may contain other strings as well
 this line is second line to be included in output...
 this is the third should also be included in output

输出几乎正确,但不包括匹配文本。

【问题讨论】:

  • 您应该只遍历列表并检查每个值是否为.startswith('line1')'line2' 等。
  • 正确。但是你不能一次捕获'line1'、'line2'和'line3'。
  • 通过 '匹配文本' ,如果你说 findall() 在输出数组中不包括组 0,只需在整个正则表达式 (<your regex>) 周围添加一个捕获组@ 示例 (^line1(.*)\nline2(.*)\nline3(.*)$)

标签: python regex


【解决方案1】:

如果您要查找不重复的第 1、2 和 3 行序列
就是这个

line1.*\s*(?!\s|line[13])line2.*\s*(?!\s|line[12])line3.*

解释

 line1 .* \s*             # line 1 plus newline(s)
 (?! \s | line [13] )     # Next cannot be line 1 or 3 (or whitespace)
 line2 .* \s*             # line 2 plus newline(s)
 (?! \s | line [12] )     # Next cannot be line 1 or 2 (or whitespace)
 line3 .*                 # line 3 

如果要捕获行内容,只需将捕获组放在(.*)周围即可

【讨论】:

  • 您的示例似乎不起作用。它匹配所有的行并给出。我得到的最接近的一个发布在原始帖子的编辑部分。
  • 阅读最后一行If you want to capture the line content, just put capture groups around (.*) 对我来说,在没有捕获组混乱的情况下显示断言更为重要。
  • 你是对的。我将您的正则表达式添加到 OP 中已编辑的代码中,它现在可以工作了。谢谢。
【解决方案2】:

这可能不是最清晰的方式(您可能想要使用正则表达式),但确实会输出您想要的内容:

sample_list = ['line1 this line a first line',
        'line1 this line is also considered as line one...',
        'line1 this line is the first line',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 this contain other strings',
        'line1 this may contain other strings as well',
        'line2 this line is second line to be included in output',
        'line3 this should also be included in output',
        'line1 what the heck is it...'
        ]
output = []
text = str
line1 = ""
line2 = ""
line3 = ""
prevStart = ""
for text in sample_list:
    if prevStart == "":
        if text.startswith("line1"):
            prevStart = "line1"
            line1 = text
    elif prevStart == "line1":
        if text.startswith("line2"):
            prevStart ="line2"
            line2 = text
        elif text.startswith("line1"):
            line1 = text
            prevStart = "line1"
        else:
            prevStart = ""
    elif prevStart == "line2":
        if text.startswith("line3"):
            prevStart = ""
            line3 = text
        else:
            prevStart = ""
    if line1 != "" and line2 != "" and line3 != "":
        output.append(line1)
        output.append(line2)
        output.append(line3)
        line1 = ""
        line2 = ""
        line3 = ""

for line in output:
    print line

这段代码的输出是:

line1 this line is the first line
line2 this line is second line to be included in output
line3 this should also be included in output
line1 this may contain other strings as well
line2 this line is second line to be included in output
line3 this should also be included in output

【讨论】:

    猜你喜欢
    • 2020-11-13
    • 1970-01-01
    • 1970-01-01
    • 2011-02-11
    • 1970-01-01
    • 1970-01-01
    • 2019-10-16
    • 2015-07-22
    • 1970-01-01
    相关资源
    最近更新 更多