如何在 sklearn 中使用分层交叉验证处理多类答案

【问题标题】：How to handle multiclass with Stratified Cross Validation in sklearn如何在 sklearn 中使用分层交叉验证处理多类
【发布时间】：2018-10-06 16:18:14
【问题描述】：

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
import time

params = {
    'min_child_weight': [1, 5, 10],
    'gamma': [0.5, 1, 1.5, 2, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [3, 4, 5]
    }



xgb = XGBClassifier(learning_rate=0.02, n_estimators=600,
                silent=True, nthread=1)

folds = 5
param_comb = 5

skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring=['f1_macro','precision_macro'], n_jobs=4, cv=skf.split(X_train,y_train), verbose=3, random_state=1001)

start_time = time.clock() # timing starts from this point for "start_time" variable
random_search.fit(X_train, y_train)
elapsed = (time.clock() - start) # timing ends here for "start_time" 
variable

我的代码在上面，我的 y_train 是一个带有多类的 pandas 系列，整数从 0 到 9。

y_train.head()
1041    8
1177    7
2966    0
1690    2
2115    1
Name: Industry, dtype: object

运行上面的设置代码后，我收到如下错误消息：

ValueError: Supported target types are: ('binary', 'multiclass'). Got 'unknown' instead.

我对其他类似问题进行了一些搜索，我尝试使用 sklearn.model_selection 中的 cross_validate 并尝试使用与多类兼容的其他指标，但仍然收到相同的错误消息。

我是否可以根据性能指标对具有分层交叉验证的参数进行网格搜索？

更新：修复dtype问题后，我想将多个指标传递给scoring=，我尝试这种方式是因为我阅读了这篇文档（http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter）：

random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring=['f1_macro','precision_macro'], n_jobs=4, cv=skf.split(X_train,y_train), verbose=3, random_state=1001)

然后我因以下警告而失败：

ValueError                                Traceback (most recent call 
last)
<ipython-input-67-dd57cd97c89c> in <module>()
 36 # Here we go
 37 start_time = time.clock() # timing starts from this point for 
"start_time" variable
---> 38 random_search.fit(X_train, y_train)
 39 elapsed = (time.clock() - start) # timing ends here for "start_time" variable

/anaconda3/lib/python3.6/site- 
packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, 
**fit_params)
609                                  "available for that metric. If 
this is not "
610                                  "needed, refit should be set to 
False "
--> 611                                  "explicitly. %r was passed." % 
self.refit)
612             else:
613                 refit_metric = self.refit

ValueError: For multi-metric scoring, the parameter refit must be set 
to a scorer key to refit an estimator with the best parameter setting 
on the whole data and make the best_* attributes available for that 
metric. If this is not needed, refit should be set to False explicitly. 
True was passed.

如何解决这个问题？

【问题讨论】：

请显示来自y_train的dtype和一些样本
@VivekKumar 我已经更新了。
为什么dtype对象在这里？里面有字符串吗？您可以尝试使用 y_train = y_train.astype(int) 将其转换为 int 吗？
@VivekKumar 我修好了，现在y_train 是一个inetger 系列。但是，我想使用传递给scoring 的多个指标，但由于一些错误而失败。
@VivekKumar 看来我需要预定义一个记分器，它应该包括我想要的所有指标，然后将其传递给scoring。我也试过了，但仍然收到一些错误。

标签： python python-3.x machine-learning scikit-learn cross-validation

【解决方案1】：

如here in user guide所写：

指定多个metric时，refit参数必须设置为将找到并使用 best_params_ 的指标（字符串）在整个数据集上构建 best_estimator_。如果搜索不应该改装，设置改装=假。将改装保留为默认值 value None 在使用多个指标时会导致错误。

由于您在这里使用了多个指标：

random_search = RandomizedSearchCV(xgb, param_distributions=params,
                                   n_iter=param_comb, 
                                   scoring=['f1_macro','precision_macro'], 
                                   n_jobs=4, 
                                   cv=skf.split(X_train,y_train), 
                                   verbose=3, random_state=1001)

RandomizedSearchCV 不知道如何找到最佳参数。它不能从两种不同的评分策略中选择最好的分数。因此，您需要指定您希望它用来查找最佳参数的评分类型。

为此，您需要将refit 参数设置为您在scoring 中使用的选项之一。像这样的：

random_search = RandomizedSearchCV(xgb, param_distributions=params,
                                   ...
                                   scoring=['f1_macro','precision_macro'], 
                                   ...
                                   refit = 'f1_macro')

【讨论】：