【发布时间】:2020-11-29 21:23:05
【问题描述】:
我有一个以product_description, price, supplier, category 为列的零售数据集。
我使用product_description 作为特征:
from sklearn import model_selection, preprocessing, naive_bayes
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['product_description'], df['category'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['product_description'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
classifier = naive_bayes.MultinomialNB().fit(xtrain_tfidf, train_y)
# predict the labels on validation dataset
predictions = classifier.predict(xvalid_tfidf)
metrics.accuracy_score(predictions, valid_y) # ~20%, very low
由于准确性非常低,我也想将供应商和价格添加为特征。如何将其合并到代码中?
我尝试过其他分类器,例如 LR、SVM 和 Random Forrest,但它们的结果(几乎)相同。
【问题讨论】:
标签: python-3.x scikit-learn text-classification supervised-learning