使用交叉验证的模型评估错误 - average_precision_score答案

【问题标题】：Model-evaluation error using cross-validation - average_precision_score使用交叉验证的模型评估错误 - average_precision_score
【发布时间】：2021-06-13 03:29:08
【问题描述】：

所以我使用balanced_accuracy作为我的评分运行了以下随机森林网格搜索：

# define the parameter grid
param_grid = [
        {'criterion': ['gini', 'entropy'],   # try different purity metrics in building the trees
         'max_depth': [2, 5, 8, 10, 15, 20],    # vary the max_depth of the trees in the ensemble
        'n_estimators': [10, 50, 100, 200],   # vary the number of trees in the ensemble
        'max_samples': [0.4, 0.7, 0.9]}     # vary how many samples each tree is built with
]

# setup the Random Forest model with all arguments as default
model = RandomForestClassifier()

# pass the model and the param_grid to the grid search, and use 5 folds with 'accuracy' as the scoring measure
grid_search = GridSearchCV(model, param_grid, cv = 5, scoring = 'balanced_accuracy')

# fit the grid search to the training set
grid_search.fit(X_smote, y_smote)

# return best model
rf_best = grid_search.best_estimator_

# return the hyperparameter values of the best model
print(grid_search.best_params_)

# use the best model to make predictions on the test set
y_pred = rf_best.predict(X_test)

# compute the test set accuracy of the best model
print("accuracy: ", accuracy_score(y_test,y_pred))
print("f1: ", f1_score(y_test, y_pred, pos_label='Listed'))
print("precision: ", precision_score(y_test, y_pred, pos_label='Listed'))
print("recall: ", recall_score(y_test, y_pred, pos_label='Listed'))

这会产生以下分数：


{'criterion': 'gini', 'max_depth': 20, 'max_samples': 0.7, 'n_estimators': 100}
accuracy:  0.6547231270358306
f1:  0.7612612612612613
precision:  0.9260273972602739
recall:  0.6462715105162524

我想使用 average_precision 评分参数，因为这更适合我的用例，因此我将语法更新为以下内容：

from sklearn.metrics import average_precision_score
# define the parameter grid
param_grid = [
        {'criterion': ['gini', 'entropy'],   # try different purity metrics in building the trees
         'max_depth': [2, 5, 8, 10, 15, 20],    # vary the max_depth of the trees in the ensemble
        'n_estimators': [10, 50, 100, 200],   # vary the number of trees in the ensemble
        'max_samples': [0.4, 0.7, 0.9]}     # vary how many samples each tree is built with
]

# setup the Random Forest model with all arguments as default
model = RandomForestClassifier()

# pass the model and the param_grid to the grid search, and use 5 folds with 'accuracy' as the scoring measure
grid_search = GridSearchCV(model, param_grid, cv = 5, scoring = 'average_precision')

# fit the grid search to the training set
grid_search.fit(X_smote, y_smote)

# return best model
rf_best = grid_search.best_estimator_

# return the hyperparameter values of the best model
print(grid_search.best_params_)

# use the best model to make predictions on the test set
y_pred = rf_best.predict(X_test)

# compute the test set accuracy of the best model
print("accuracy: ", accuracy_score(y_test,y_pred))
print("f1: ", f1_score(y_test, y_pred, pos_label='Listed'))
print("precision: ", precision_score(y_test, y_pred, pos_label='Listed'))
print("recall: ", recall_score(y_test, y_pred, pos_label='Listed'))

但是我收到以下错误：

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\metrics\_ranking.py in average_precision_score(y_true, y_score, average, pos_label, sample_weight)
    211         if len(present_labels) == 2 and pos_label not in present_labels:
    212             raise ValueError("pos_label=%r is invalid. Set it to a label in "
--> 213                              "y_true." % pos_label)
    214     average_precision = partial(_binary_uninterpolated_average_precision,
    215                                 pos_label=pos_label)

ValueError: pos_label=1 is invalid. Set it to a label in y_true.

为什么我不能像使用balanced_accuracy 那样在我的代码中使用average_precision。有什么我应该做的不同的事情吗？

【问题讨论】：

代码的第二部分有错字吗？你应该使用average_precision_score 而不是precision_score
@StupidWolf，我先尝试过，但收到以下错误消息：'ValueError：'average_precision_score'不是有效的评分值。使用 sorted(sklearn.metrics.SCORERS.keys()) 获取有效选项。'

标签： python scikit-learn classification

【解决方案1】：

不知道您的数据集是什么样的，也不知道代码中的错误到底在哪里。多余的部分太多了。

如果目的是使用所述的平均精度分数，那么您可以使用make_scorer，假设您的标签是二进制的，0/1，如下例所示：

from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = [
        {'criterion': ['gini', 'entropy'],   
         'max_depth': [2,5],    
        'n_estimators': [200],   
        'max_samples': [0.8]}]


X, y = make_blobs(n_samples=[80,20], centers=None, n_features=5,
cluster_std = 3.5,random_state=0)     

model = RandomForestClassifier(random_state=42)
grid_search_acc = GridSearchCV(model, param_grid, cv = 5, scoring = 'balanced_accuracy')

grid_search_acc.fit(X, y)

grid_search_acc.best_score_
0.75625

平衡精度有效，使其适用于平均精度：

from sklearn.metrics import average_precision_score, make_scorer
ap_score = make_scorer(precision_score, greater_is_better=True, pos_label=1)

grid_search_prec = GridSearchCV(model, param_grid, cv = 5, scoring = ap_score)
grid_search_prec.fit(X, y)

grid_search_prec.best_score_
0.9333333333333332

【讨论】：