SkLearn SVM - 如何获得按概率排序的多个预测？答案

【问题标题】：SkLearn SVM - How to get multiple predictions ordered by probability?SkLearn SVM - 如何获得按概率排序的多个预测？
【发布时间】：2019-10-19 00:39:21
【问题描述】：

我正在做一些文本分类。假设我有 10 个类别和 100 个“样本”，其中每个样本都是一个文本句子。我已将样本分成 80:20（训练、测试）并训练了 SVM 分类器：

text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words=('english'),ngram_range=(1,2))), ('tfidf', TfidfTransformer()),
                         ('clf-svm', SGDClassifier(loss='hinge', penalty='l2', random_state=42, learning_rate='adaptive', eta0=0.9))])

# Fit training data to SVM classifier, predict with testing data and print accuracy
text_clf_svm = text_clf_svm.fit(training_data, training_sub_categories)

现在说到预测，我不希望只预测一个类别。例如，我想查看给定未见样本的“前 5 个”类别列表及其相关概率：

top_5_category_predictions = text_clf_svm.predict(a_single_unseen_sample)

由于text_clf_svm.predict 返回一个代表可用类别索引的值，我希望看到这样的输出：

[(4,0.70),(1,0.20),(7,0.04),(9,0.06)]

有人知道如何实现吗？

【问题讨论】：

predict_proba 将完成工作的部分（即不是排序部分），但它只能用于log 和modified_huber 损失，而不是与hinge（即SVM）；见docs

标签： python-3.x machine-learning scikit-learn svm

【解决方案1】：

这是我不久前用来解决类似问题的东西：

probs = clf.predict_proba(X_test)
# Sort desc and only extract the top-n
top_n_category_predictions = np.argsort(probs)[:,:-n-1:-1]

这将为您提供每个样本的前 n 个类别。

如果你还想查看这些类别对应的概率，那么你可以这样做：

top_n_probs = np.sort(probs)[:,:-n-1:-1]

注意：这里X_test 的形状是(n_samples, n_features)。因此，请确保以相同的格式使用您的 single_unseen_sample。

【讨论】：