【问题标题】:Performance metric when using XGboost regressor with sklearn learning_curve将 XGboost 回归器与 sklearn learning_curve 一起使用时的性能指标
【发布时间】:2021-09-13 05:41:44
【问题描述】:

我已经创建了 xgboost 回归模型,想看看随着训练集数量的增加,训练和测试性能如何变化。

xgbm_reg = XGBRegressor()
tr_sizes, tr_scs, test_scs = learning_curve(estimator=xgbm_reg,
                                           X=ori_X,y=y,
                                           train_sizes=np.linspace(0.1, 1, 5),
                                           cv=5) 

tr_scs 和 test_scs 的性能如何?

Sklearn doc 告诉我

scoring : str or callable, default=None

    A str (see model evaluation documentation) or a scorer callable object / function
 with signature scorer(estimator, X, y)

所以我查看了XGboost documentation,它说目标是default = reg:squarederror,这是否意味着 tr_scs 和 test_scs 的结果是平方误差?

我想使用 cross_val_score 检查

scoring = "neg_mean_squared_error"
cv_results = cross_val_score(xgbm_reg, ori_X, y, cv=5, scoring=scoring)

但不太清楚如何从 cross_val_score 中得到 squared_error

【问题讨论】:

    标签: python scikit-learn xgboost


    【解决方案1】:

    XGBRegressor 的内置记分器是 R 平方,这是 learning_curvecross_val_score 中使用的默认记分器,请参见下面的代码。

    from xgboost import XGBRegressor
    from sklearn.datasets import make_regression
    from sklearn.model_selection import learning_curve, cross_val_score, KFold
    from sklearn.metrics import r2_score
    
    # generate the data
    X, y = make_regression(n_features=10, random_state=100)
    
    # generate 5 CV splits
    kf = KFold(n_splits=5, shuffle=False)
    
    # calculate the CV scores using `learning_curve`, use 100% train size for comparison purposes
    _, _, lc_scores = learning_curve(estimator=XGBRegressor(), X=X, y=y, train_sizes=[1.0], cv=kf)
    print(lc_scores)
    # [[0.51444244 0.70020972 0.64521668 0.36608259 0.81670165]]
    
    # calculate the CV scores using `cross_val_score`
    cv_scores = cross_val_score(estimator=XGBRegressor(), X=X, y=y, cv=kf)
    print(cv_scores)
    # [0.51444244 0.70020972 0.64521668 0.36608259 0.81670165]
    
    # calculate the CV scores manually
    xgb_scores = []
    r2_scores = []
    
    # iterate across the CV splits
    for train_index, test_index in kf.split(X):
    
        # extract the training and test data
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
    
        # fit the model to the training data
        estimator = XGBRegressor()
        estimator.fit(X_train, y_train)
    
        # score the test data using the XGBRegressor built-in scorer
        xgb_scores.append(estimator.score(X_test, y_test))
    
        # score the test data using the R-squared
        y_pred = estimator.predict(X_test)
        r2_scores.append(r2_score(y_test, y_pred))
    
    print(xgb_scores)
    # [0.5144424362721487, 0.7002097211679331, 0.645216683969211, 0.3660825936288453, 0.8167016490227281]
    
    print(r2_scores)
    # [0.5144424362721487, 0.7002097211679331, 0.645216683969211, 0.3660825936288453, 0.8167016490227281]
    

    【讨论】:

    • 喜欢你在 learning_curve 中设置 train_sizes=[1.0] 的方式,没想到。这个棒极了!谢谢:)
    猜你喜欢
    • 2021-03-23
    • 1970-01-01
    • 2018-11-05
    • 2019-02-17
    • 2019-05-02
    • 2021-03-11
    • 2018-08-02
    • 2020-03-12
    • 2018-05-16
    相关资源
    最近更新 更多