从多类分类算法输出前 2 个类答案

【问题标题】：Output top 2 classes from a multiclass classification algorithm从多类分类算法输出前 2 个类
【发布时间】：2020-09-10 22:51:14
【问题描述】：

我正在研究 text 的多类分类问题，其中我有很多不同的类（15 岁以上）。我已经训练了一个 Linearsvc svm 方法（方法只是示例）。但是它只输出概率最高的单个类，有没有一种算法可以同时输出两个类

我正在使用的示例代码：

from sklearn.svm import LinearSVC
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
count_vect = CountVectorizer(max_df=.9,min_df=.002,  
                             encoding='latin-1', 
                             ngram_range=(1, 3))
X_train_counts = count_vect.fit_transform(df_upsampled['text'])
tfidf_transformer = TfidfTransformer(sublinear_tf=True,norm='l2')
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = LinearSVC().fit(X_train_tfidf, df_upsampled['reason'])
y_pred = model.predict(X_test)

当前输出：

    source  user   time    text         reason
0   hi      neha    0      0:neha:hi       1
1   there   ram     1      1:ram:there     1
2   ball    neha    2      2:neha:ball     3
3   item    neha    3      3:neha:item     6
4   go there ram    4      4:ram:go there  7
5   kk       ram    5      5:ram:kk        1
6   hshs    neha    6      6:neha:hshs     2
7   ggsgs   neha    7      7:neha:ggsgs    15

想要的输出：

    source  user   time    text         reason  reason2
0   hi      neha    0      0:neha:hi       1      2
1   there   ram     1      1:ram:there     1      6
2   ball    neha    2      2:neha:ball     3      7
3   item    neha    3      3:neha:item     6      4
4   go there ram    4      4:ram:go there  7      9
5   kk       ram    5      5:ram:kk        1      2
6   hshs    neha    6      6:neha:hshs     2      3
7   ggsgs   neha    7      7:neha:ggsgs    15     1

如果我只在一列中获得输出也没关系，因为我可以从中拆分并制作两列。

【问题讨论】：

标签： python-3.x scikit-learn text-classification multiclass-classification

【解决方案1】：

LinearSVC 不提供predict_proba，但它提供了decision_function，它给出了与超平面的有符号距离。

来自文档：

decision_function(self, X)：

预测样本的置信度分数。

样本的置信度分数是该样本到超平面的有符号距离。

基于@warped cmets，

我们可以使用decision_function 输出，从模型中找到排名靠前的n 预测类。

import pandas as pd 
from sklearn.datasets import make_classification
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

X, y = make_classification(n_samples=1000, 
                           n_clusters_per_class=1,
                           n_informative=10,
                           n_classes=5, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=42)
clf = make_pipeline(StandardScaler(),
                    LinearSVC(random_state=0, tol=1e-5))
clf.fit(X, y)
top_n_classes = 2
predictions = clf.decision_function(
                    X_test).argsort()[:,-top_n_classes:][:,::-1]
pred_df = pd.DataFrame(predictions, 
                       columns= [f'{i+1}_pred' for i in range(top_n_classes)])

df = pd.DataFrame({'true_class': y_test})
df = df.assign(**pred_df)

df

【讨论】：

感谢您的回答，我正在寻找获得最大概率的最后一堂课。但是我会通过排序和获取位置索引来解决这个问题，如果你已经准备好解决方案，这会导致两个原因..那么请帮助我
只需更改top_n_classes=2，您将得到前两个原因。
No Venkat，它给出的是随机结果，我已经检查过了，第一列本身与 clf.predict() 不匹配，请您在最后尝试一次
在我发布的示例中，它工作正常。可能您的数据确实有足够的信息让模型预测正确的值。你的模型的测试精度是多少？
我已经用线性 svc 进行了交叉验证，它的准确率达到了 91%，在手动检查之后，经过训练的模型做得很好。

【解决方案2】：

linearSVC 有一个名为 decision_function 的方法，它给出了各个类的置信度分数：

样本的置信度分数是该样本的有符号距离样本到超平面。

3 类数据集示例：

from sklearn.datasets import make_classification
import numpy as np    

# dummy dataset
X, y = make_classification(n_classes=3, n_clusters_per_class=1)

#train classifier and get decision scores
clf = LinearSVC().fit(X, y)
decision = clf.decision_function(X)
decision = np.round(decision, 2)

prediction = clf.predict(X)

# looking at decision scores and the predicted class:

for a, b in zip(decision, prediction):
    print(a, b)

[...]
[ 3.04 -0.61 -7.1 ] 0
[-4.99  1.85 -1.62] 1
[ 3.01 -0.98 -5.93] 0
[-2.61 -1.12  2.64] 2
[-3.43 -0.65  1.32] 2
[-1.78 -1.67  4.15] 2
[...]

you can see that the classifier takes the class with maximum score as prediction. 
To get the best two, you would take the two highest scores.

编辑：

注意signed distance 的含义：

决策函数的符号：

+：是（数据点属于类）

-: no（数据点不属于类）

决策函数的绝对值：

表示对决定的信心。

上面代码第一行的例子：

[ 3.04 -0.61 -7.1 ] 0

Decision for class 1: 3.04 => 这个分类器认为数据属于第 1 类，确定性得分为 3.04。

Decision for class 2: -.61 => 这个分类器认为数据不属于第 2 类，确定性得分为 0.61。

Decision for class 3: -7.1 => 这个分类器认为数据不属于类 2，确定性得分为 7.1。

【讨论】：