正则表达式：查找包含特定字符串的单词列表 - Python答案

【问题标题】：RegEx: Find lists of words containing specific string - Python正则表达式：查找包含特定字符串的单词列表 - Python
【发布时间】：2021-08-28 16:21:13
【问题描述】：

我正在寻找包含以下任何序列的单词：“tion”、“ex”、“ph”、“ost”、“ist”、“ast”。

到目前为止，这是我的功能：

def latin_ish_words(text):
        return re.findall('tion|ex|ph|ost|ist|ast+\b', text, re.I))

但是，这只是返回特定的序列而不是完整的单词。

示例 1：latin_ish_words("This functions as expected") 返回['tion', 'ex']，而我正在寻找["functions", "expected"]。

示例 2：text = 'Philosophy ex nihilo existed in the past' 返回 ['Ph', 'ph', 'ex', 'ex']，而我正在寻找 ['Philosophy', 'ex', 'existed', 'past']

看了Re官方文档，我以为'\b'返回了完整的单词？

有什么建议吗？

【问题讨论】：

只是为了确认一下，为什么示例 2 应该匹配“过去”？正则表达式是否应该阅读 ast+ 而不是 ost+ ？
好地方-谢谢。我已经更新了我的问题。

标签： python python-re findall

【解决方案1】：

您可以尝试使用[a-z]* 捕获单词的前缀和后缀。

def latin_ish_words(text):
    return re.findall(r'\b([a-z]*(tion|ex|ph|ost)[a-z]*)\b', text, re.I)

In [1]: latin_ish_words("Philosophy ex nihilo existed in the past")
Out[1]: [('Philosophy', 'ph'), ('ex', 'ex'), ('existed', 'ex')]

您尝试捕获的单词是结果列表中每个元组的第一个元素。

【讨论】：

【解决方案2】：

也许拆分句子并单独检查每个单词？

import re

def latin_ish_words(text):
    words = text.split(' ')
    matched_words = []

    for word in words:
        if re.findall('tion|ex|ph|ost', word, re.I):
            matched_words.append(word)

    return matched_words

【讨论】：

【解决方案3】：

你可以试试这个：

import re
def latin_ish_words(text):
        return re.findall(r'(\w*(?:tion|ex|ph|ost|ast)\w*)', text, re.I)

text = 'Philosophy ex nihilo existed in the past.'
latin_ish_words(text)

它给出：

['Philosophy', 'ex', 'existed', 'past']

(?: ) 表示非捕获组，即应匹配但不应作为re.findall() 结果之一自行返回的模式。

【讨论】：