用特定的单词集在标题之间提取段落[重复]答案

【问题标题】：extrract a paragraph between heading with specific set of words [duplicate]用特定的单词集在标题之间提取段落[重复]
【发布时间】：2017-09-18 18:20:19
【问题描述】：

我有一个包含如下数据的文本文件：

History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms

现在我想提取包含特定单词集的段落或特定部分，例如{" Software", opensource" }

我尝试过regexp 和if loop，但无法提取所需的输出，谁能帮帮我。

【问题讨论】：

标签： python grep information-extraction

【解决方案1】：

使用正则表达式：

import re
my_string = """History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms
"""
pattern = '\n.+(?:software|open\s?source).+\n'
paragraph_list = re.findall(pattern, my_string)
print(paragraph_list)

您最终会得到所有段落都包含您在列表中提到的关键字paragraph_list

编辑

如果您希望关键字是动态的，或者由列表/元组提供：

import re
keywords = ('software', 'open source')

my_string = """History

The term "data science" (originally used interchangeably with "datalogy") has existed for over thirty years and was used initially as a substitute for computer science by Peter Naur in 1960. In 1974, Naur published Concise Survey of Computer Methods, which freely used the term data science in its survey of the contemporary data processing methods that are used in a wide range of application

Application 

In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms
"""
pattern = '\n.+(?:' + '|'.join(keywords) + ').+\n'
paragraph_list = re.findall(pattern, my_string)
print(paragraph_list)

【讨论】：

【解决方案2】：

你可以很容易地找到一个子字符串是否是更大的一部分：

>>> str='In the 2010–2011 time frame, data science software reached an inflection point where open source software started supplanting proprietary software.[30] The use of open source software enables modifying and extending the software, and it allows sharing of the resulting algorithms'
>>> "software" in str
True

您可以提取文件中包含特定单词的行：

>>> f = open('yourfile.txt','r')
>>> result=[i for i in data if 'software' in i]

【讨论】：