【问题标题】:Extract multiline text between two strings using python使用python在两个字符串之间提取多行文本
【发布时间】:2023-01-15 15:23:20
【问题描述】:

我有一个文本文件,看起来像下面的虚拟文件

Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, 
when an unknown printer took a galley of type and
some random characters and then start of my data
some characters in between
some characters in between
some characters in between
some characters in between
some characters in between
some characters in between
end of my data
scrambled it to make a type specimen book. 
It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised
in the 1960s with the release of Letraset 
when an unknown printer took a galley of type and
some random characters and then start of my data
some characters in between
some characters in between
some characters in between
some characters in between
some characters in between
some characters in between
end of my data
sheets containing Lorem Ipsum passages,
and more recently with desktop publishing
when an unknown printer took a galley of type and
some random characters and then start of my data
some characters in between
some characters in between
some characters in between
some characters in between
some characters in between
some characters in between
end of my data
software like Aldus PageMaker including
versions of Lorem Ipsum.

我想在“我的数据开始”到“我的数据结束”之间提取数据并将其保存在列表变量中。此数据在文本文件中多次出现。我尝试了下面的代码

import re
import sys
s=[]
with open('mytextfile.txt','r') as file:
    mystring = file.read()
    myre = re.compile(r"start of my data(.*?)end of my data", re.DOTALL)
    parts = myre.findall(mystring)
    s.append(parts)

此代码将所有找到的字符串一次保存在列表的第一个索引上。但我需要新索引上的每个单独数据。我怎样才能做到这一点?

【问题讨论】:

  • 按换行符拆分数据?
  • 是的,换行符从数据开始到数据结束
  • 好的,那就去做吧。

标签: python text split


【解决方案1】:

使用s.append(parts),您将整个列表parts作为单个元素附加到数组s,这就是为什么s最终只有一个元素(这是一个包含3个元素的列表)。相反,如果您想将 parts 的 3 个元素分别附加到 s,则需要 s.extend(parts)

【讨论】:

  • 在阅读了 cmets 之后,看起来您可能希望通过换行符进一步拆分各个部分,在这种情况下,@Thomas Weller 的答案似乎可以解决问题(此外,如果您想避免出现空行,您可能需要执行 part.strip().split(" ")每个部分开头和结尾的元素)。
【解决方案2】:

通过 拆分捕获组的数据行:

import re
s=[]
mystring = """
paste your string here
"""
myre = re.compile(r"start of my data(.*?)end of my data", re.DOTALL)
parts = myre.findall(mystring)
for part in parts:
    s.extend(part.split("
"))
print(len(s))

提供的示例数据的结果是 24。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-10-27
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多