如何从句子中计算相同的单词？答案

【问题标题】：How can I count same words from sentences?如何从句子中计算相同的单词？
【发布时间】：2020-06-06 04:35:11
【问题描述】：

我想问如何从句子中计算相同的单词（在 Python 中）。

举个例子，像这样的句子： “多么美好的一天。鸟儿在歌唱，孩子们在笑。”

我要提取的是： ['what':1, 'a':1, 'wonderful':1, 'dat':1, 'birds':1, 'are':2, 'singing':1, 'children':1, '笑':1]

我在这里做了：

sent = "What a wonderful day. Birds are singing, children are laughing."
b = set([word.lower() for word in a])
c = list(b)

如果此代码不适合该工作，请告诉我。谢谢。

【问题讨论】：

标签： python string count word

【解决方案1】：

您可以为此使用counter 和重新

import re
from collections import Counter
remove_punctutation = re.findall("[A-Za-z]+",sent)
print(dict(Counter(remove_punctutation)))
#{'What': 1,'a': 1,'wonderful': 1,'day': 1,'Birds': 1,'are': 2,'singing': 1,'children': 1,'laughing': 1}

【讨论】：

【解决方案2】：

collections.Counter 可用于计算列表中任何内容的出现次数。这是一个好的开始。但是，这意味着我们应该首先将句子变成单词列表并删除标点符号。

要列出单词，有一个名为.split() 的方法可以将句子拆分为空格。而要去除标点符号，.strip() 方法是个不错的选择。

正如您已经暗示的那样，我们还应该规范化这个案例。为此，最好使用.casefold() 而不是.lower()。在某些当地人中，这些并不相同。

总而言之，导致代码看起来有点像：

import string
from collections import Counter

sent = "What a wonderful day. Birds are singing, children are laughing."
words = [word.strip(string.punctuation).casefold() for word in sent.split()]
freq = Counter(words)

【讨论】：

【解决方案3】：

使用collections.Counter + string.strip 去除标点符号：

from collections import Counter
import string

sent = "What a wonderful day. Birds are singing, children are laughing."

c = Counter([x.strip(string.punctuation) for x in sent.split()])
print(c)

# Counter({'are': 2, 'What': 1, 'a': 1, 'wonderful': 1, 'day': 1, 'Birds': 1, 'singing': 1, 'children': 1, 'laughing': 1})

如果您希望它不区分大小写，请在查找计数之前转换为小写，如下所示：

s = sent<b>.lower()</b>.translate(str.maketrans('', '', string.punctuation))

【讨论】：

不区分大小写比使用.lower() 更好的是使用.casefold()，因为它可以更好地处理国际化。
@JohanL 虽然这是真的，但必须对数据的实际情况进行大量考虑。如果是简单的英语（没有 unicodes），使用 lower() 是最干净的并且可以避免查找。
我不同意“干净”。 .casefold() 函数是出于这个特殊原因而添加的，因此我认为默认应该是使用它，如果没有强烈的理由不使用它（例如速度）。避免查找只是程序员通常不应该关心的实现细节。
@JohanL 我不反对；你说的对。但是对于使用其中任何一个会产生相同结果而一个在速度上有一点优势并且可能更具可读性（主观）的场景，它需要一个飞跃。是的，这不像您在第一条评论中所说的那样普遍。顺便说一句，我刚刚意识到我的代码中有一个缺陷。