在 Python 中计算单词答案

【问题标题】：Count Words in Python在 Python 中计算单词
【发布时间】：2015-06-14 01:22:53
【问题描述】：

我在 python 中有一个字符串列表。

list = [ "Sentence1. Sentence2...", "Sentence1. Sentence2...",...]

我想删除停用词并计算所有不同字符串组合中每个单词的出现次数。有简单的方法吗？

我目前正在考虑使用 scikit 中的 CountVectorizer()，而不是迭代每个单词并组合结果

【问题讨论】：

什么是停用词？所以你想连接一个长字符串，然后计算出现次数，对吗？
所需输出的示例会很有帮助。
看看stackoverflow.com/questions/19560498/…
@wouter 基本上你可以认为我有一堆文档，我想计算一个单词在文档中出现的次数。
如果你使用 tf-idf 那么你不需要删除停用词

标签： python list scikit-learn

【解决方案1】：

如果你不介意安装一个新的 python 库，我建议你使用gensim。第一个教程完全按照您的要求进行：

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

然后您需要为您的文档语料库创建字典并创建词袋。

dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future 
print(dictionary)

您可以使用 tf-idf 和其他东西对结果进行加权，然后很容易地进行 LDA。

看看教程1here

【讨论】：

这似乎很有帮助，谢谢！

【解决方案2】：

您未能彻底解释您的想法，但这可能是您正在寻找的：

counts = collections.Counter(' '.join(your_list).split())

【讨论】：

您的代码是否逐个字符串加入不同的字符串？
是的，所有的字符串都连接起来，然后用空格分隔。