【发布时间】:2018-07-14 22:22:57
【问题描述】:
我希望我的分类算法能够基于一组类别对基于自然语言的原始数据进行分类,当且仅当它要满足某个类别的特定阈值准确度(比如 80% 的准确度),否则我想要我的分类器将该特定的原始文本分类为“未分类”类别。我该怎么做?
我的示例数据集:
+----------------------+------------+
| Details | Category |
+----------------------+------------+
| Any raw text1 | cat1 |
+----------------------+------------+
| any raw text2 | cat1 |
+----------------------+------------+
| any raw text5 | cat2 |
+----------------------+------------+
| any raw text7 | cat1 |
+----------------------+------------+
| any raw text8 | cat2 |
+----------------------+------------+
| Any raw text4 | cat4 |
+----------------------+------------+
| any raw text5 | cat4 |
+----------------------+------------+
| any raw text6 | cat3 |
+----------------------+------------+
这将是我的训练数据,我会将相同的数据划分为测试集和训练集
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
data= pd.read_csv('mydata.xls.gold', delimiter='\t',usecols=
['Details','Category'],encoding='utf-8')
target_one=data['Category']
target_list=data['Category'].unique()
x_train, x_test, y_train, y_test = train_test_split(data.Details,
data.NUM_CATEGORY, random_state=42)
vect = CountVectorizer(ngram_range=(1,2))
#converting traning features into numeric vector
X_train = vect.fit_transform(x_train.values.astype('U'))
#converting training labels into numeric vector
X_test = vect.transform(x_test.values.astype('U'))
start = time.clock()
mnb = MultinomialNB(alpha =0.13)
mnb.fit(X_train,y_train)
result= mnb.predict(X_test)
print (time.clock()-start)
# mnb.predict_proba(x_test)[0:10,1]
accuracy_score(result,y_test)
我该如何进行?分类器是否需要设置任何参数? 提前致谢。
【问题讨论】:
-
这段代码的输出是什么?有没有报错?
-
@AkshayNevrekar 它只是粗略的代码,它只会打印准确性,我想知道如果我必须根据某个阈值对原始文本进行分类,该如何进行。
-
检查
predict_proba()函数。您可以通过使用predict_proba()scikit-learn.org/stable/modules/generated/… 的输出编写自己的函数来应用一些阈值 -
感谢@AkshayNevrekar
标签: python-3.x machine-learning scikit-learn text-classification