带有重复单词的正则表达式模式计数[重复]答案

【问题标题】：Regex pattern counting with repetitive words [duplicate]带有重复单词的正则表达式模式计数[重复]
【发布时间】：2020-07-01 23:37:00
【问题描述】：

我尝试编写一个 python 函数来计算字符串中的特定单词。

当我要计算的单词连续重复多次时，我的正则表达式模式不起作用。否则，该模式似乎运作良好。

这是我的功能

import re

def word_count(word, text):
    return len(re.findall('(^|\s|\b)'+re.escape(word)+'(\,|\s|\b|\.|$)', text, re.IGNORECASE))

当我用随机字符串测试它时

>>> word_count('Linux', "Linux, Word, Linux")
2

当我要数的词与自己相邻时

>>> word_count('Linux', "Linux Linux")
1

【问题讨论】：

请注意，'\b' 是退格字符，'\x08'，而不是 '\\b'

标签： python regex

【解决方案1】：

问题出在您的正则表达式中。您的正则表达式正在使用 2 个捕获组，re.findall 将返回任何可用的捕获组。这需要使用(?:...)更改为非捕获组

除了有理由使用(^|\s|\b) 作为\b 或字边界就足够了，它涵盖了除了\b 为零宽度之外的所有情况。

同样的方法(\,|\s|\b|\.|$)可以改成\b。

所以你可以使用：

def word_count(word, text):
     return len(re.findall(r'\b' + re.escape(word) + r'\b', text, re.I))

这将给出：

>>> word_count('Linux', "Linux, Word, Linux")
2
>>> word_count('Linux', "Linux Linux")
2

【讨论】：

感谢您的回复！我只是编辑了我的问题以更准确。我想计算一个可能连续重复多次次的单词。所以 word_count('Linux', 'Linux Linux Linux') 会返回 3。
我的错字已修复。这将为word_count('Linux', 'Linux Linux Linux')返回3

【解决方案2】：

我不确定这是 100%，因为当您只是在寻找在字符串中重复的单词时，我不明白关于将要搜索的单词传递给函数的部分。所以也许考虑...

import re

pattern = r'\b(\w+)( \1\b)+'

def word_count(text):
    split_words = text.split(' ')
    count = 0
    for split_word in split_words:
        count = count + len(re.findall(pattern, text, re.IGNORECASE))
    return count

word_count('Linux Linux Linux Linux')

输出：

也许有帮助。

更新：基于以下评论...

def word_count(word, text):
    count = text.count(word)
    return count

word_count('Linux', "Linux, Word, Linux")

输出：

【讨论】：

OP 正在寻找“计算字符串中的特定单词”。例如"Linux"在"Linux, Word, Linux"中出现了两次，所以函数应该返回2。
更新了答案。也许这有用？
这是计算子串，而不是单词。例如。 word_count('race', 'racer') 是 1，但应该是 0。
如果你真的想使用.count 方法，你可以将字符串拆分成一个列表，例如re.split(r'\W+', text)，但它使不区分大小写的搜索变得更加困难。