XGBoost 随机给出“0.5”的静态预测答案

【问题标题】：XGBoost giving a static prediction of "0.5" randomlyXGBoost 随机给出“0.5”的静态预测
【发布时间】：2020-04-07 11:32:15
【问题描述】：

我正在使用带有 XGBRegressor 的 scikit-learn 管道。管道运行良好，没有任何错误。当我使用这条管道进行预测时，我会多次预测相同的数据，有时预测值会随机出现 0.5，而正常的预测范围是 (1000-10,000)

例如： (1258.2,1258.2,1258.2,1258.2,1258.2,1258.2,0.5,1258.2,1258.2,1258.2,1258.2)

输入数据完全相同

环境一样

numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())])
    categorical_transformer = Pipeline(steps=[
        ('imputer',
         SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

numeric_features = X.select_dtypes(
    include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(
    include=['object']).columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Number of trees
n_estimators = [int(x) for x in
                np.linspace(start=50, stop=1000, num=10)]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(1, 32, 32, endpoint=True)]
# Booster
booster = ['gbtree', 'gblinear', 'dart']
# selecting gamma
gamma = [i / 10.0 for i in range(0, 5)]
# Learning rate
learning_rate = np.linspace(0.01, 0.2, 15)
# Evaluation metric
#         eval_metric = ['rmse','mae']
# regularization
reg_alpha = [1e-5, 1e-2, 0.1, 1, 100]
reg_lambda = [1e-5, 1e-2, 0.1, 1, 100]
# Min chile weight
min_child_weight = list(range(1, 6, 2))
# Samples
subsample = [i / 10.0 for i in range(6, 10)]
colsample_bytree = [i / 10.0 for i in range(6, 10)]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth,
               'booster': booster,
               'gamma': gamma,
               'learning_rate': learning_rate,
               #                        'eval_metric' : eval_metric,
               'reg_alpha': reg_alpha,
               'reg_lambda': reg_lambda,
               'min_child_weight': min_child_weight,
               'subsample': subsample,
               'colsample_bytree': colsample_bytree
               }

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = xgboost.XGBRegressor(objective='reg:squarederror', n_jobs=4)
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator=rf,
                               param_distributions=random_grid,
                               n_iter=100,
                               cv=3,
                               verbose=0,
                               random_state=42,
                               n_jobs=4)

pipe = Pipeline(steps=[('preprocessor', preprocessor),
                       ('regressor', rf_random)])

pipe.fit(X, y)

可能是什么问题？

【问题讨论】：

标签： python machine-learning scikit-learn xgboost

【解决方案1】：

如果您得到一些异常低的预测，这可能表明因变量存在异常值。我建议您阅读它，以及解决此问题的不同策略或建议。

通常在不去除异常值的情况下考虑模型的所有数据样本并不是一个好主意。这将导致更糟糕且不具代表性的指标。

【讨论】：

感谢您的回复。这是真实的。异常值已被删除。我给出的例子属于 1 个观察。如果我多次预测相同的观察结果，可以说在 50 次中我得到了这个“0.5”的结果，即使对于其他观察结果，我也得到了准确的“0.5”。这是完全有效的场景。如果您还有其他此类情况，请告诉我。