Python提取包含单词的句子答案

【问题标题】：Python extract sentence containing wordPython提取包含单词的句子
【发布时间】：2013-04-08 14:37:31
【问题描述】：

我正在尝试从文本中提取包含指定单词的所有句子。

txt="I like to eat apple. Me too. Let's go buy some apples."
txt = "." + txt
re.findall(r"\."+".+"+"apple"+".+"+"\.", txt)

但它正在返回我：

[".I like to eat apple. Me too. Let's go buy some apples."]

而不是：

[".I like to eat apple., "Let's go buy some apples."]

有什么帮助吗？

【问题讨论】：

标签： python regex text-segmentation

【解决方案1】：

你可以使用str.split，

>>> txt="I like to eat apple. Me too. Let's go buy some apples."
>>> txt.split('. ')
['I like to eat apple', 'Me too', "Let's go buy some apples."]

>>> [ t for t in txt.split('. ') if 'apple' in t]
['I like to eat apple', "Let's go buy some apples."]

【讨论】：

【解决方案2】：

In [7]: import re

In [8]: txt=".I like to eat apple. Me too. Let's go buy some apples."

In [9]: re.findall(r'([^.]*apple[^.]*)', txt)
Out[9]: ['I like to eat apple', " Let's go buy some apples"]

但请注意@jamylak 的基于split 的解决方案更快：

In [10]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
1000000 loops, best of 3: 1.96 us per loop

In [11]: %timeit [s+ '.' for s in txt.split('.') if 'apple' in s]
1000000 loops, best of 3: 819 ns per loop

对于较大的字符串，速度差异较小，但仍然很重要：

In [24]: txt = txt*10000

In [25]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
100 loops, best of 3: 8.49 ms per loop

In [26]: %timeit [s+'.' for s in txt.split('.') if 'apple' in s]
100 loops, best of 3: 6.35 ms per loop

【讨论】：

【解决方案3】：

不需要正则表达式：

>>> txt = "I like to eat apple. Me too. Let's go buy some apples."
>>> [sentence + '.' for sentence in txt.split('.') if 'apple' in sentence]
['I like to eat apple.', " Let's go buy some apples."]

【讨论】：

@user2187202 如果您愿意，您可以接受我的回答，或者如果这确实是您需要的，则可以接受正则表达式解决方案，因为您确实将其标记为正则表达式问题，我不确定这是否必要或虽然不是

【解决方案4】：

In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt)                                                                                                                             
Out[4]: ['I like to eat apple.', " Let's go buy some apples."]

【讨论】：

我怎样才能通过添加边界只得到apple：['我喜欢吃苹果。']

【解决方案5】：

r"\."+".+"+"apple"+".+"+"\."

这行有点奇怪；为什么要连接这么多单独的字符串？你可以只使用 r'..+apple.+.'.

无论如何，正则表达式的问题在于它的贪心。默认情况下，x+ 将尽可能频繁地匹配x。因此，您的 .+ 将匹配尽可能多的字符（any 个字符）；包括点和apples。

你想使用的是非贪婪表达式；您通常可以通过在末尾添加? 来做到这一点：.+?。

这将使您得到以下结果：

['.I like to eat apple. Me too.']

如您所见，您不再获得两个苹果句子，但仍然获得 Me too.。那是因为你在apple之后仍然匹配.，所以不可能不捕获下面的句子。

一个有效的正则表达式是这样的：r'\.[^.]*?apple[^.]*?\.'

在这里你看不到任何个字符，而只看到那些本身不是点的字符。我们还允许根本不匹配任何字符（因为在第一句中的apple 之后没有非点字符）。使用该表达式会导致：

['.I like to eat apple.', ". Let's go buy some apples."]

【讨论】：

【解决方案6】：

显然，有问题的样本是extract sentence containing substring，而不是
extract sentence containing word。如何通过python解决extract sentence containing word问题如下：

一个词可以在句子的开头|中间|结尾。不限于问题中的示例，我将提供一个在句子中搜索单词的通用功能：

def searchWordinSentence(word,sentence):
    pattern = re.compile(' '+word+' |^'+word+' | '+word+' $')
    if re.search(pattern,sentence):
        return True

限于问题中的例子，我们可以这样解决：

txt="I like to eat apple. Me too. Let's go buy some apples."
word = "apple"
print [ t for t in txt.split('. ') if searchWordofSentence(word,t)]

对应的输出是：

['I like to eat apple']

【讨论】：

【解决方案7】：

import nltk
search = "test"
text = "This is a test text! Best text ever. Cool"
contains = [s for s in nltk.sent_tokenize(text) if search in s]

【讨论】：