将 XGboost 回归器与 sklearn learning_curve 一起使用时的性能指标答案

【问题标题】：Performance metric when using XGboost regressor with sklearn learning_curve将 XGboost 回归器与 sklearn learning_curve 一起使用时的性能指标
【发布时间】：2021-09-13 05:41:44
【问题描述】：

我已经创建了 xgboost 回归模型，想看看随着训练集数量的增加，训练和测试性能如何变化。

xgbm_reg = XGBRegressor()
tr_sizes, tr_scs, test_scs = learning_curve(estimator=xgbm_reg,
                                           X=ori_X,y=y,
                                           train_sizes=np.linspace(0.1, 1, 5),
                                           cv=5)

tr_scs 和 test_scs 的性能如何？

Sklearn doc 告诉我

scoring : str or callable, default=None

    A str (see model evaluation documentation) or a scorer callable object / function
 with signature scorer(estimator, X, y)

所以我查看了XGboost documentation，它说目标是default = reg:squarederror，这是否意味着 tr_scs 和 test_scs 的结果是平方误差？

我想使用 cross_val_score 检查

scoring = "neg_mean_squared_error"
cv_results = cross_val_score(xgbm_reg, ori_X, y, cv=5, scoring=scoring)

但不太清楚如何从 cross_val_score 中得到 squared_error

【问题讨论】：

标签： python scikit-learn xgboost

【解决方案1】：

XGBRegressor 的内置记分器是 R 平方，这是 learning_curve 和 cross_val_score 中使用的默认记分器，请参见下面的代码。

from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import learning_curve, cross_val_score, KFold
from sklearn.metrics import r2_score

# generate the data
X, y = make_regression(n_features=10, random_state=100)

# generate 5 CV splits
kf = KFold(n_splits=5, shuffle=False)

# calculate the CV scores using `learning_curve`, use 100% train size for comparison purposes
_, _, lc_scores = learning_curve(estimator=XGBRegressor(), X=X, y=y, train_sizes=[1.0], cv=kf)
print(lc_scores)
# [[0.51444244 0.70020972 0.64521668 0.36608259 0.81670165]]

# calculate the CV scores using `cross_val_score`
cv_scores = cross_val_score(estimator=XGBRegressor(), X=X, y=y, cv=kf)
print(cv_scores)
# [0.51444244 0.70020972 0.64521668 0.36608259 0.81670165]

# calculate the CV scores manually
xgb_scores = []
r2_scores = []

# iterate across the CV splits
for train_index, test_index in kf.split(X):

    # extract the training and test data
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # fit the model to the training data
    estimator = XGBRegressor()
    estimator.fit(X_train, y_train)

    # score the test data using the XGBRegressor built-in scorer
    xgb_scores.append(estimator.score(X_test, y_test))

    # score the test data using the R-squared
    y_pred = estimator.predict(X_test)
    r2_scores.append(r2_score(y_test, y_pred))

print(xgb_scores)
# [0.5144424362721487, 0.7002097211679331, 0.645216683969211, 0.3660825936288453, 0.8167016490227281]

print(r2_scores)
# [0.5144424362721487, 0.7002097211679331, 0.645216683969211, 0.3660825936288453, 0.8167016490227281]

【讨论】：

喜欢你在 learning_curve 中设置 train_sizes=[1.0] 的方式，没想到。这个棒极了！谢谢:)