CountVectorizer 不适用于 Python 中的训练数据答案

【问题标题】：CountVectorizer does not work on training data in PythonCountVectorizer 不适用于 Python 中的训练数据
【发布时间】：2019-07-23 23:14:43
【问题描述】：

我正在使用 scikit learn 对文本进行分类。我用过CountVectorizer。我认为CountVectorizer 应该只用于训练数据，而不是所有数据（特征）。

我已在所有数据（特征）上使用它并且代码有效，但是当我仅在训练中使用它时，它显示此错误：

TypeError：传递了稀疏矩阵，但需要密集数据。利用 X.toarray() 转换为密集的 numpy 数组。

这是我的代码（代码非常简单，仅举例）：

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

from sklearn import tree
from sklearn.metrics import accuracy_score


df = pd.DataFrame({"second":["yes ofc", "not a chance", " hell no", "yes yes yes", "yes",'yes maybe', 'yes ofc', 'no not'],
                  "third":["true","false", "false", "true", "false", "true","false", "false"]})

##CHANGE HERE
results = df['third']
features = df['second']

cv = CountVectorizer()  
#features = cv.fit_transform(features) #it worked

features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)

#features_train = cv.fit_transform(features_train).toarray() #it does not work
#result_train = cv.fit_transform(result_train).toarray() #it does not work

cls = tree.DecisionTreeClassifier()
model = cls.fit(features_train, result_train)

acc_prediction  = model.predict(features_test)
accuracy_test = accuracy_score(result_test, acc_prediction)

print(accuracy_test)

【问题讨论】：

您可能希望将字符串转换为整数或布尔值。它不会解决你的问题，但它会更好。
这只是一个例子。真正的值是更长的字符串，并且有很多值
你仍然想让它们分类。
我猜结果应该是 df['third'] 因为 results 是这里的标签（true, false）。
result_train 不应该被馈送到CountVectorizer ？

标签： python scikit-learn

【解决方案1】：

您应该只在训练数据上训练（或fit）计数器矢量化器，但同时在训练和测试数据上运行。

创建CountVectorizer后：

cv = CountVectorizer()

并将数据分成训练集和测试集：

features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)

fit_transformfeatures_train 在继续之前，因为您想使用计数矢量化器转换的数据来训练您的实际分类器：

features_train = cv.fit_transform(features_train)

现在，在此之后，cv 将拟合训练数据，并且也转换了训练数据。现在，使用这些转换后的数据训练实际的分类器：

cls = tree.DecisionTreeClassifier()
model = cls.fit(features_train, result_train)

现在，您的分类器使用 count vectorized 训练数据进行训练。在测试数据上测试准确性时，首先使用相同的计数向量器对测试数据进行转换：

features_test = cv.transform(features_test)

请注意，您没有再次拟合它，我们只是使用已经训练过的计数矢量化器来转换此处的测试数据。现在，使用经过训练的决策树分类器进行预测：

acc_prediction = model.predict(features_test)
accuracy_test = accuracy_score(result_test, acc_prediction)
print(accuracy_test)

【讨论】：

所以我应该在features_train 和features_test 上运行CountVectorizer，而不是在features_train 和result_train 上运行，对吧？如果我真的像你告诉我的那样，我会防止过度拟合吗？
在features_train 和features_test 上运行CountVectorizer 而不是result_train 不必对过度拟合做任何事情。只是您没有在标签 (result_train) 上运行计数矢量化器，您只需在您正在训练模型的数据上运行它。
@AhmadKhan 由于您将数据拆分为训练集和测试集，因此两者的维度将相同。但是你如何预测与训练数据分开的测试数据呢？因为运行 cv.transform 会因尺寸不匹配而出错。

【解决方案2】：

要将拟合模型应用于测试数据，请使用.transform()。下面的代码是我的建议！

另外，.toarray() 是将稀疏矩阵转换为密集矩阵的代价高昂的操作。因此，在绝对需要之前不要使用它。决策树可以使用稀疏矩阵本身。

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

from sklearn import tree
from sklearn.metrics import accuracy_score


df = pd.DataFrame({"second":["yes ofc", "not a chance", " hell no", "yes yes yes", "yes",'yes maybe', 'yes ofc', 'no not'],
                  "third":["true","false", "false", "true", "false", "true","false", "false"]})

##CHANGE HERE
results = df['third']
features = df['second']

cv = CountVectorizer()  

features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)

features_train = cv.fit_transform(features_train) 
features_test = cv.transform(features_test) 

cls = tree.DecisionTreeClassifier()
model = cls.fit(features_train, result_train)

acc_prediction  = model.predict(features_test)
accuracy_test = accuracy_score(result_test, acc_prediction)

print(accuracy_test)

【讨论】：

【解决方案3】：

试试这个：

features = cv.fit_transform(features)
X_train , X_test = cross_validation.train_test_split(features , test_size=0.3 , random_state=0)
Y_train , Y_test = cross_validation.train_test_split(results, test_size=0.3 , random_state=0)

【讨论】：

但是我得到了什么？我认为我应该转换 X_train 和 Y_train

【解决方案4】：

以下代码有效。我猜你在分配结果和特征方面做错了。

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

from sklearn import tree
from sklearn.metrics import accuracy_score


df = pd.DataFrame({"second":["yes ofc", "not a chance", " hell no", "yes yes yes", "yes",'yes maybe', 'yes ofc', 'no not'],
                  "third":["true","false", "false", "true", "false", "true","false", "false"]})

##CHANGE HERE
results = df['third']
features = df['second']

cv = CountVectorizer()  
features = cv.fit_transform(features) #it worked

features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)

#features_train = cv.fit_transform(features_train) #it does not work
#result_train = cv.fit_transform(result_train) #it does not work

cls = tree.DecisionTreeClassifier()
model = cls.fit(features_train, result_train)

acc_prediction  = model.predict(features_test)
accuracy_test = accuracy_score(result_test, acc_prediction)

print(accuracy_test)

如果您想在训练集和测试集上分别运行 CountVectorizer，那么下面是执行此操作的方法：

{SAME AS ABOVE TILL HERE}

results = df['third']
features = df['second']

cv = CountVectorizer()  

features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)

features_train = cv.fit_transform(features_train) #it does not work

cls = tree.DecisionTreeClassifier()
model = cls.fit(features_train, result_train)

acc_prediction  = model.predict(cv.transform(features_test))
accuracy_test = accuracy_score(result_test, acc_prediction)

print(accuracy_test)

【讨论】：

是的，我做错了。但这不是我的问题的答案。我应该只对训练数据还是所有数据运行 CountVectorizer？如何仅对训练数据运行 CountVectorizer？
您需要对所有数据运行 CountVectorizer。您必须以相同的方式转换训练数据和测试数据。您不能只转换训练数据并使用原始格式的测试数据。