h2o vs scikit 学习混淆矩阵答案

【问题标题】：h2o vs scikit learn confusion matrixh2o vs scikit 学习混淆矩阵
【发布时间】：2019-01-22 23:06:59
【问题描述】：

谁能将 sklearn 混淆矩阵与 h2o 相匹配？

他们永远不会匹配......

用 Keras 做类似的事情会产生完美的匹配。

但在 h2o 中，它们始终处于关闭状态。各种方法都试过了……

借用了一些代码： Any difference between H2O and Scikit-Learn metrics scoring?

# In[30]:
import pandas as pd
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()

# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()

# Train and cross-validate a GBM
model = H2OGradientBoostingEstimator(distribution="bernoulli", seed=1)
model.train(x=x, y=y, training_frame=train)

# In[31]:
# Test AUC
model.model_performance(test).auc()
# 0.7817203808052897

# In[32]:

# Generate predictions on a test set
pred = model.predict(test)

# In[33]:

from sklearn.metrics import roc_auc_score, confusion_matrix

pred_df = pred.as_data_frame()
y_true = test[y].as_data_frame()

roc_auc_score(y_true, pred_df['p1'].tolist())
#pred_df.head()

# In[36]:

y_true = test[y].as_data_frame().values
cm = pd.DataFrame(confusion_matrix(y_true, pred_df['predict'].values))

# In[37]:

print(cm)
    0     1
0  1354   961
1   540  2145

# In[38]:
model.model_performance(test).confusion_matrix()

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.353664307031828: 

    0         1     Error   Rate
0   964.0   1351.0  0.5836  (1351.0/2315.0)
1   274.0   2411.0  0.102   (274.0/2685.0)
Total   1238.0  3762.0  0.325   (1625.0/5000.0)

# In[39]:
h2o.cluster().shutdown()

【问题讨论】：

您在 scikit-learn 混淆矩阵中传递的值基于不同的阈值（最佳 f1 的训练阈值）。但是model_performance(test).confusion_matrix() 使用阈值0.35366..，因此结果不同。
只需打印model 即可获得详细信息。
嗨@VivekKumar，我对你的建议做了几乎相同的事情，但仍然没有得到相同的结果。请看我下面的回答，看看我是否犯了一些错误。

标签： python-3.x scikit-learn classification h2o confusion-matrix

【解决方案1】：

这可以解决问题，感谢 Vivek 的预感。仍然不是完全匹配，但非常接近。

perf = model.model_performance(train)
threshold = perf.find_threshold_by_max_metric('f1')
model.model_performance(test).confusion_matrix(thresholds=threshold)

【讨论】：

是的。这就是为什么我没有将其发布为答案。因为有了训练阈值，我能够接近但不完全相同。我觉得你应该把这个发到the H2O issues here，这样你就可以得到开发者确认的答案。

【解决方案2】：

我也遇到了同样的问题。以下是我将做的公平比较：

model.train(x=x, y=y, training_frame=train, validation_frame=test)
cm1 = model.confusion_matrix(metrics=['F1'], valid=True)

由于我们使用训练数据和验证数据训练模型，因此pred['predict'] 将使用the threshold which maximizes the F1 score of validation data。为了确保，可以使用这些行：

threshold = perf.find_threshold_by_max_metric(metric='F1', valid=True)
pred_df['predict'] = pred_df['p1'].apply(lambda x: 0 if x < threshold else 1)

从 scikit learn 中获取另一个混淆矩阵：

from sklearn.metrics import confusion_matrix

cm2 = confusion_matrix(y_true, pred_df['predict'])

就我而言，我不明白为什么我得到的结果略有不同。例如：

print(cm1)
>> [[3063  176]
    [  94  146]]

print(cm2)
>> [[3063  176]
    [  95  145]]

【讨论】：

也许这里发生了舍入。请使用print(model)打印模型阈值并将其与perf.find_threshold_by_max_metric找到的阈值进行比较
正如您在其他答案讨论中看到的那样，即使我们无法获得完全相同的结果。所以也许将其发布到H2O github issues 可能会有所帮助