c++ 中的 xgboost 加载模型（python -> c++ 预测分数不匹配）答案

【问题标题】：xgboost load model in c++ (python -> c++ prediction scores mismatch)c++ 中的 xgboost 加载模型（python -> c++ 预测分数不匹配）
【发布时间】：2017-01-13 02:06:08
【问题描述】：

我正在接触所有 SO c++ 天才。

我已经在 python 中训练（并成功测试）了一个 xgboost 模型，如下所示：

dtrain 
=xgb.DMatrix(np.asmatrix(X_train),label=np.asarray(y_train,dtype=np.int), feature_names=feat_names)

optimal_model = xgb.train(plst, dtrain)

dtest = xgb.DMatrix(np.asmatrix(X_test),feature_names=feat_names)

optimal_model.save_model('sigdet.model')

我关注了 XgBoost (see link) 上的帖子，其中解释了在 c++ 中加载和应用预测的正确方法：

// Load Model
g_learner = std::make_unique<Learner>(Learner::Create({}));
        std::unique_ptr<dmlc::Stream> fi(
            dmlc::Stream::Create(filename, "r"));
        g_learner->Load(fi.get());

// Predict
    DMatrixHandle h_test;
        XGDMatrixCreateFromMat((float *)features, 1, numFeatures , -999.9f, &h_test);
        xgboost::bst_ulong out_len;


        std::vector<float> preds;
        g_learner->Predict((DMatrix*)h_test,true, &preds);

我的问题（1）：我需要创建一个 DMatrix*，但是我只有一个 DMatrixHandle。如何使用我的数据正确创建 DMatrix？

我的问题（2）：当我尝试以下预测方法时：

DMatrixHandle h_test;
XGDMatrixCreateFromMat((float *)features, 1, numFeatures , -999.9f, &h_test);
xgboost::bst_ulong out_len;


int res = XGBoosterPredict(g_modelHandle, h_test, 1, 0, &out_len, (const float**)&scores);

与加载完全相同的模型并将其用于预测（在 python 中）相比，我得到的分数完全不同。

谁能帮助我在 c++ 和 python 中取得一致的结果，谁就会上天堂。顺便说一句，我需要在 C++ 中为实时应用程序应用预测，否则我会使用不同的语言。

【问题讨论】：

标签： python c++ xgboost

【解决方案1】：

要获得 DMatrix，您可以这样做：

g_learner->Predict(static_cast<std::shared_ptr<xgboost::DMatrix>*>(h_test)->get(), true, &pred);

对于问题 (2)，我没有答案。这实际上是我遇到的同样的问题。我在 python 中有一个 XGBRegression，我在 C++ 中使用相同的功能获得了不同的结果。

【讨论】：

【解决方案2】：

所以您使用的方法序列化模型：

    optimal_model.save_model('sigdet.model')

此方法将其所有功能名称的模型串联（参见https://github.com/dmlc/xgboost/issues/3089）。

当您将模型加载到C ++以进行预测时，不一定保持列特征排序。您可以通过调用.dump_model（）方法来验证这一点。

另外，在Python和C ++模型对象上调用.dump_model（）将产生相同的决策树，但Python将拥有所有功能名称和C ++可能具有F0，F1，F2，。 ..你可以比较这两个人以获得实际列订购，然后您的预测将匹配语言（不完全，B / C）。

我不知道列如何订购，但它似乎是一个稳定的过程，即使在滑动数据窗口中重写相同的模型时也保持订单。我不是100％的自信，也会欣赏清晰度。

此问题存在于大量Python训练，其他语言预测XGBoost模型。我面临着Java，它看起来似乎有一种方法可以延伸横跨XGBoost的不同绑定功能。

【讨论】：

【解决方案3】：

这里是一个例子，但是程序的预测是一样的：

const int cols=3,rows=100;
float train[rows][cols];
for (int i=0;i<rows;i++)
    for (int j=0;j<cols;j++)
        train[i][j] = (i+1) * (j+1);

float train_labels[rows];
for (int i=0;i<50;i++)
    train_labels[i] = 0;
for (int i=50;i<rows;i++)
    train_labels[i] = 1;


// convert to DMatrix
DMatrixHandle h_train[1];
XGDMatrixCreateFromMat((float *) train, rows, cols, -1, &h_train[0]);

// load the labels
XGDMatrixSetFloatInfo(h_train[0], "label", train_labels, rows);

// read back the labels, just a sanity check
bst_ulong bst_result;
const float *out_floats;
XGDMatrixGetFloatInfo(h_train[0], "label" , &bst_result, &out_floats);
for (unsigned int i=0;i<bst_result;i++)
    std::cout << "label[" << i << "]=" << out_floats[i] << std::endl;

// create the booster and load some parameters
BoosterHandle h_booster;
XGBoosterCreate(h_train, 1, &h_booster);
XGBoosterSetParam(h_booster, "objective", "binary:logistic");
XGBoosterSetParam(h_booster, "eval_metric", "error");
XGBoosterSetParam(h_booster, "silent", "0");
XGBoosterSetParam(h_booster, "max_depth", "9");
XGBoosterSetParam(h_booster, "eta", "0.1");
XGBoosterSetParam(h_booster, "min_child_weight", "3");
XGBoosterSetParam(h_booster, "gamma", "0.6");
XGBoosterSetParam(h_booster, "colsample_bytree", "1");
XGBoosterSetParam(h_booster, "subsample", "1");
XGBoosterSetParam(h_booster, "reg_alpha", "10");

// perform 200 learning iterations
for (int iter=0; iter<10; iter++)
    XGBoosterUpdateOneIter(h_booster, iter, h_train[0]);

// predict
const int sample_rows = 100;
float test[sample_rows][cols];
for (int i=0;i<sample_rows;i++)
    for (int j=0;j<cols;j++)
        test[i][j] = (i+1) * (j+1);
DMatrixHandle h_test;
XGDMatrixCreateFromMat((float *) test, sample_rows, cols, -1, &h_test);
bst_ulong out_len;
const float *f;
XGBoosterPredict(h_booster, h_test, 0,0,&out_len,&f);

for (unsigned int i=0;i<out_len;i++)
    std::cout << "prediction[" << i << "]=" << f[i] << std::endl;


// free xgboost internal structures
XGDMatrixFree(h_train[0]);
XGDMatrixFree(h_test);
XGBoosterFree(h_booster);

【讨论】：

【解决方案4】：

在问题（2）中，使用python训练模型并使用C++进行预测。特征向量是一个 float* 数组。

DMatrixHandle h_test;
XGDMatrixCreateFromMat((float *)features, 1, numFeatures , -999.9f, &h_test);
xgboost::bst_ulong out_len;
int res = XGBoosterPredict(g_modelHandle, h_test, 1, 0, &out_len, (const 
float**)&scores);

因此，您的模型需要使用密集矩阵格式（numpy 数组）进行训练。下面是官方文档中的python sn-p。

data = np.random.rand(5, 10)  # 5 entities, each contains 10 features
label = np.random.randint(2, size=5)  # binary target
dtrain = xgb.DMatrix(data, label=label)

【讨论】：