带有 Pandas 数据框的 CountVectorizer答案

【问题标题】：CountVectorizer with Pandas dataframe带有 Pandas 数据框的 CountVectorizer
【发布时间】：2017-10-20 09:19:51
【问题描述】：

我正在使用 scikit-learn 进行文本处理，但我的 CountVectorizer 没有给出我期望的输出。

我的 CSV 文件如下所示：

"Text";"label"
"Here is sentence 1";"label1"
"I am sentence two";"label2"
...

等等。

我想先使用 Bag-of-Words 来了解 Python 中的 SVM 是如何工作的：

import pandas as pd
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer

data = pd.read_csv(open('myfile.csv'),sep=';')

target = data["label"]
del data["label"]

# Creating Bag of Words
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data)
X_train_counts.shape 
count_vect.vocabulary_.get(u'algorithm')

但是当我执行print(X_train_counts.shape) 时，我看到输出只有(1,1)，而我有1048 行带有句子。

我做错了什么？我正在关注this 教程。

（同样count_vect.vocabulary_.get(u'algorithm')的输出是None。）

【问题讨论】：

为什么count_vect.vocabulary_.get(u'algorithm') 不是None？该术语在您的示例中没有定义。
@aryamccarthy 好的。使用算法是有道理的。但是形状是怎么回事？
您的数据框有两列 Text 和 label 但显然您只想在 data['Text'] 列上运行 CountVectorizer！不在data['label'] 列。

标签： python python-3.x scikit-learn

【解决方案1】：

问题出在count_vect.fit_transform(data)。该函数需要一个产生字符串的迭代。不幸的是，这些是错误的字符串，可以通过一个简单的示例来验证。

for x in data:
    print(x)
# Text

只打印列名；迭代给出列而不是data['Text'] 的值。你应该这样做：

X_train_counts = count_vect.fit_transform(data.Text)
X_train_counts.shape 
# (2, 5)
count_vect.vocabulary_
# {'am': 0, 'here': 1, 'is': 2, 'sentence': 3, 'two': 4}

【讨论】：