为什么这个 CountVectorizer 输出与我的字数不同？答案

【问题标题】：Why is this CountVectorizer output different from my word counts?为什么这个 CountVectorizer 输出与我的字数不同？
【发布时间】：2019-12-18 16:53:50
【问题描述】：

我有一个数据框，其中有一列名为“短语”。我使用以下代码找到了本专栏中最常见的 20 个单词：

print(pd.Series(' '.join(film['Phrase']).lower().split()).value_counts()[:20])

这给了我以下输出：

s             16981
film           6689
movie          5905
nt             3970
one            3609
like           3071
story          2520
rrb            2438
lrb            2098
good           2043
characters     1882
much           1862
time           1747
comedy         1721
even           1597
little         1575
funny          1522
way            1511
life           1484
make           1396

我后来需要为每个单词创建向量计数。我尝试使用以下代码这样做：

vectorizer = CountVectorizer()
vectorizer.fit(film['Phrase'])
print(vectorizer.vocabulary_)

我不会显示整个输出，但输出数字与上面的输出不同。例如，“电影”这个词是 9308，“好”这个词是 6131，“制作”这个词是 8655。为什么会这样？ value counts 方法是否只计算使用该单词的每一列而不是计算该单词的每次出现？我是否误解了 CountVectorizer 对象在做什么？

【问题讨论】：

这个问题是一个简单的误解。 CountVectorizer.vocabulary_ is not the word counts, it's the term-to-feature mapping, and the doc page tells you so

标签： python pandas scikit-learn countvectorizer

【解决方案1】：

vectorizer.vocabulary_ 确实不返回词频，但根据文档：

术语到特征索引的映射

这意味着数据中的每个单词都会映射到一个索引，该索引存储在vectorizer.vocabulary_。

这里有一个例子来说明正在发生的事情：

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

df = pd.DataFrame({"a":["we love music","we love piano"]})

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['a'])
print(vectorizer.vocabulary_)

>>> {'we': 3, 'love': 0, 'music': 1, 'piano': 2}

此向量化识别数据中的 4 个单词，并为每个单词分配 0 到 3 的索引。现在，您可能会问：“但我为什么还要关心这些指数呢？”因为一旦矢量化完成，您需要跟踪矢量化对象中单词的顺序。例如，

X.toarray()
>>> array([[1, 1, 0, 1],
           [1, 0, 1, 1]], dtype=int64)

使用词汇词典，你可以知道第一列对应“爱”，第二列对应“音乐”，第三列对应“钢琴”，第四列对应“我们”。

注意，这也对应vectorizer.get_feature_names()中的单词顺序

vectorizer.get_feature_names()
>>> ['love', 'music', 'piano', 'we']

【讨论】：

谢谢！这现在更有意义了！

【解决方案2】：

正如@MaximeKan 所提到的，CountVectorizer() 不会计算每个术语的频率，但我们可以从 transform() 的稀疏矩阵输出和vectorizer 的get_feature_names() 属性计算它。

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(film['Phrase'])
{x:y for x,y in zip(vectorizer.get_feature_names(), X.sum(0).getA1())}

工作示例：

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)

在必要之前不要使用.toarray()，因为它需要更多的内存大小和计算时间。我们可以直接用稀疏矩阵求和。

>>> list(zip(vectorizer.get_feature_names(), X.sum(0).getA1()))

[('and', 1),
 ('document', 4),
 ('first', 2),
 ('is', 4),
 ('one', 1),
 ('second', 1),
 ('the', 4),
 ('third', 1),
 ('this', 4)]

【讨论】：