如何使用 Scikit Learn CountVectorizer 在语料库中获取词频？答案

【问题标题】：How do I get word frequency in a corpus using Scikit Learn CountVectorizer?如何使用 Scikit Learn CountVectorizer 在语料库中获取词频？
【发布时间】：2015-02-13 19:24:42
【问题描述】：

我正在尝试使用 scikit-learn 的 CountVectorizer 计算一个简单的词频。

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird","bird"]
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print cv.vocabulary_
{u'bird': 0, u'cat': 1, u'dog': 2, u'fish': 3}

我期待它返回{u'bird': 2, u'cat': 3, u'dog': 2, u'fish': 2}。

【问题讨论】：

CountVectorizer 创建“术语到特征索引的映射” - 如果您只想要频率，为什么不使用collections.Counter？

标签： python scikit-learn

【解决方案1】：

cv.vocabulary_ 在这种情况下是一个字典，其中键是您找到的单词（特征），值是索引，这就是为什么它们是 0, 1, 2, 3。它看起来和你的计数相似只是运气不好:)

您需要使用 cv_fit 对象来获取计数

from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())
#['bird', 'cat', 'dog', 'fish']
#[[0 1 1 1]
# [0 2 1 0]
# [1 0 0 1]
# [1 0 0 0]]

数组中的每一行都是您的原始文档（字符串）之一，每一列是一个特征（单词），元素是该特定单词和文档的计数。你可以看到，如果你对每一列求和，你会得到正确的数字

print(cv_fit.toarray().sum(axis=0))
#[2 3 2 2]

老实说，我建议使用collections.Counter 或来自 NLTK 的东西，除非你有特定的理由使用 scikit-learn，因为它会更简单。

【讨论】：

【解决方案2】：

cv_fit.toarray().sum(axis=0) 肯定给出了正确的结果，但是对稀疏矩阵求和然后转换为数组会快得多：

np.asarray(cv_fit.sum(axis=0))

【讨论】：

【解决方案3】：

我们将使用 zip 方法从单词列表及其计数列表中生成 dict

import pandas as pd
import numpy as np    
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird","bird"]    

cv = CountVectorizer()   
cv_fit=cv.fit_transform(texts)    
word_list = cv.get_feature_names();    
count_list = cv_fit.toarray().sum(axis=0)

print word_list
['鸟', '猫', '狗', '鱼']
print count_list
[2 3 2 2]
print dict(zip(word_list,count_list))
{'鱼'：2，'狗'：2，'鸟'：2，'猫'：3}

【讨论】：

cv_fit.toarray().sum(axis=0) 使 RAM 爆炸，因为它需要加密稀疏矩阵。查看@pieterbons 的答案以获得更好的方法。

【解决方案4】：

结合其他人的观点和我自己的一些观点:) 这是我为你准备的东西

from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text='''Note that if you use RegexpTokenizer option, you lose 
natural language features special to word_tokenize 
like splitting apart contractions. You can naively 
split on the regex \w+ without any need for the NLTK.
'''

# tokenize
raw = ' '.join(word_tokenize(text.lower()))

tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
words = tokenizer.tokenize(raw)

# remove stopwords
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]

# count word frequency, sort and return just 20
counter = Counter()
counter.update(words)
most_common = counter.most_common(20)
most_common

#输出（全部）

[('注', 1), ('使用', 1), ('regexptokenizer', 1), （'选项1）， ('输', 1), ('自然', 1), ('语言', 1), ('特征', 1), ('特殊', 1), ('字', 1), ('tokenize', 1), ('喜欢', 1), ('分裂', 1), ('分开', 1), （“宫缩”，1）， ('天真', 1), ('分裂', 1), ('正则表达式', 1), ('没有', 1), ('需要', 1)]

在效率方面可以做得比这更好，但如果你不是太担心的话，这个代码是最好的。

【讨论】：

旧版本的代码抛出以下导入错误：“NameError: name 'word_tokenize' is not defined”，刚刚将导入添加到第 4 行。干得好，很好的解决方案@Pradeep-Singh。

【解决方案5】：

结合@YASH-GUPTA 的答案以获得可读结果和@pieterbons 的RAM效率，但需要进行调整并添加几个括号。工作代码：

import numpy as np    
from sklearn.feature_extraction.text import CountVectorizer

texts = ["dog cat fish", "dog cat cat", "fish bird", "bird"]    

cv = CountVectorizer()   
cv_fit = cv.fit_transform(texts)    
word_list = cv.get_feature_names()

# Added [0] here to get a 1d-array for iteration by the zip function. 
count_list = np.asarray(cv_fit.sum(axis=0))[0]

print(dict(zip(word_list, count_list)))
# Output: {'bird': 2, 'cat': 3, 'dog': 2, 'fish': 2}

【讨论】：