打印给定字符串中两个特定单词之间的单词答案

【问题标题】：print words between two particular words in a given string打印给定字符串中两个特定单词之间的单词
【发布时间】：2016-12-06 21:26:06
【问题描述】：

如果一个特定的单词不以另一个特定的单词结尾，请留下它。这是我的字符串：

x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'

我想打印并计算john 和dead or death or died. 之间的所有单词如果john 不以任何died or dead or death 单词结尾。别管它。以 john word 重新开始。

我的代码：

x = re.sub(r'[^\w]', ' ', x)  # removed all dots, commas, special symbols

for i in re.findall(r'(?<=john)' + '(.*?)' + '(?=dead|died|death)', x):
    print i
    print len([word for word in i.split()])

我的输出：

 got shot 
2
 with his          john got killed or 
6
 with his wife 
3

我想要的输出：

got shot
2
got killed or
3
with his wife
3

我不知道我在哪里做错了。它只是一个示例输入。我必须一次检查 20,000 个输入。

【问题讨论】：

您的观点不明确。由于with his john got killed or 出现在 john 之后，所以它算作 6？
@MarlonAbeykoon john with his .... ? , john got killed or died 第一个 john 单词不以 dead or death or died 结尾。从第二个john 字开始。我想要的输出是got killed or 而不是with his john got killed or

标签： python regex python-2.7

【解决方案1】：

我假设，你想重新开始，当在dead|died|death 出现之前，你的字符串中还有另一个john。

然后，您可以用单词john 拆分字符串，然后开始匹配结果部分：

x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
x = re.sub('\W+', ' ', re.sub('[^\w ]', '', x)).strip()
for e in x.split('john'):
    m = re.match('(.+?)(dead|died|death)', e)
    if m:
        print(m.group(1))
        print(len(m.group(1).split()))

产量：

 got shot 
2
 got killed or 
3
 with his wife 
3

另外，请注意，在我在这里提出的替换之后（在拆分和匹配之前），字符串如下所示：

john got shot dead john with his john got killed or died in 1990 john with his wife dead or died

即，序列中没有多个空格。您稍后通过空格分割来管理它，但我觉得这更干净一些。

【讨论】：

不错的解决方案，但直到第一个 John 之前，它才适用于该部分。添加一个 [1:] 切片应该没问题:)
对，如果有句子以... dead john 开头（即第一个john 之前的内容，包含三个停用词之一），它也会将其视为匹配。我会解决的。

【解决方案2】：

您可以使用这个否定的前瞻正则表达式：

>>> for i in re.findall(r'(?<=john)(?:(?!john).)*?(?=dead|died|death)', x):
...     print i.strip()
...     print len([word for word in i.split()])
...

got shot
2
got killed or
3
with his wife
3

这个正则表达式使用的是(?:(?!john).)*?，而不是你的.*?，它只会在john 不存在于该匹配项中时延迟匹配0 个或多个任何字符。

我还建议使用单词边界使其匹配完整的单词：

re.findall(r'(?<=\bjohn\b)(?:(?!\bjohn\b).)*?(?=\b(?:dead|died|death)\b)', x)

Code Demo

【讨论】：

比我的解决方案更优雅，拿这个。