【发布时间】:2019-07-23 23:14:43
【问题描述】:
我正在使用 scikit learn 对文本进行分类。我用过CountVectorizer。我认为CountVectorizer 应该只用于训练数据,而不是所有数据(特征)。
我已在所有数据(特征)上使用它并且代码有效,但是当我仅在训练中使用它时,它显示此错误:
TypeError:传递了稀疏矩阵,但需要密集数据。利用 X.toarray() 转换为密集的 numpy 数组。
这是我的代码(代码非常简单,仅举例):
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import tree
from sklearn.metrics import accuracy_score
df = pd.DataFrame({"second":["yes ofc", "not a chance", " hell no", "yes yes yes", "yes",'yes maybe', 'yes ofc', 'no not'],
"third":["true","false", "false", "true", "false", "true","false", "false"]})
##CHANGE HERE
results = df['third']
features = df['second']
cv = CountVectorizer()
#features = cv.fit_transform(features) #it worked
features_train, features_test, result_train, result_test = train_test_split(features, results, test_size = 0.3, random_state = 42)
#features_train = cv.fit_transform(features_train).toarray() #it does not work
#result_train = cv.fit_transform(result_train).toarray() #it does not work
cls = tree.DecisionTreeClassifier()
model = cls.fit(features_train, result_train)
acc_prediction = model.predict(features_test)
accuracy_test = accuracy_score(result_test, acc_prediction)
print(accuracy_test)
【问题讨论】:
-
您可能希望将字符串转换为整数或布尔值。它不会解决你的问题,但它会更好。
-
这只是一个例子。真正的值是更长的字符串,并且有很多值
-
你仍然想让它们分类。
-
我猜结果应该是 df['third'] 因为 results 是这里的标签(true, false)。
-
result_train不应该被馈送到CountVectorizer?
标签: python scikit-learn