在 NaN 中将 predict_proba 与 X_test 结果合并答案

【问题标题】：Merge predict_proba with X_test results in NaN在 NaN 中将 predict_proba 与 X_test 结果合并
【发布时间】：2021-04-24 19:51:03
【问题描述】：

在我的分类问题中，y='late_or_ahead'。值为 1 表示超前。值为 0 表示迟到。

log_reg.predict_proba(X_test) 结果

array([[0.92537486, 0.07462514],
   [0.24936417, 0.75063583],
   [0.6222988 , 0.3777012 ],
   [0.29020199, 0.70979801],
    ....
   [0.93961168, 0.06038832]]

输入：log_reg.classes_ 返回array([0,1]) 如果我理解正确，则表示数组的左侧部分表示 Y=0 的概率，数组的右边部分表示 Y=1 的概率。如果我在这方面有错误，请纠正我。

有了这个假设：

proba = pd.DataFrame(log_reg.predict_proba(X_test)) #convert array to dataframe
proba.columns = ['probability_late','probability_ahead']
proba


   probability_late probability_ahead
0   0.925375           0.074625
1   0.249364           0.750636
2   0.622299           0.377701
3   0.290202           0.709798
4   0.939612           0.060388
... ... ...

现在，当我使用以下代码将这两列（probability_late 和probability_ahead）与 X_test 结合起来时：

proba.reset_index(drop=True)
test_with_proba=X_test
test_with_proba.reset_index(drop=True)
test_with_proba['probability_late']=proba['probability_late']
test_with_proba['probability_ahead']=proba['probability_ahead']
test_with_proba[['probability_late','probability_ahead']]

结果如下：

367 NaN            NaN
219 NaN            NaN
72  0.167852    0.832148
55  0.338693    0.661307
371 NaN            NaN
... ... ...

这里有什么问题？

【问题讨论】：

因为索引。如果 test_with_proba 的索引与 proba 匹配，它将为 test_with_proba 赋值。分配纯粹是在索引匹配的行上。这就是为什么你得到 nan
对不起，我不太明白你的意思。如何解决？
检查我已经解释的答案。
你没有赋值回应该是test_with_proba = test_with_proba.reset_index(drop=True)或者使用test_with_proba.reset_index(drop=True, inplace=True) --> inplace=True

标签： python python-3.x machine-learning scikit-learn

【解决方案1】：

假设你有：

df1:

    a   b
1   2   3
4   2   5

df2:

    a   b
1   6   3
5   2   8

df1['c'] = df2['a']

df1:

    a   b   c
1   2   3   6.0
4   2   5   NaN

你在做作业时可以看到（左连接）。

df1 具有索引 [1,4] 但 df2 具有索引 [1,5]

虽然只分配索引 1（df1）正在与 df2 索引匹配。

所以索引 4 将具有 NaN 值。

那么如何解决这个问题呢？？

只需使用.reset_index(drop=True)重置索引

df1 = df1.reset_index(drop=True)
df2 = df2.reset_index(drop=True)

df1['c'] = df2['a']

df1:

    a   b   c
0   2   3   6
1   2   5   2

【讨论】：

谢谢，但在实施更改后我仍然遇到同样的错误。请参考我所做的编辑..
你能检查一下这两个数据框的形状吗？ test_with_proba & proba
test_with_proba.shape ==> (89,15)。 proba.shape ==> (89,2)
现在可以使用了。非常感谢，祝您有愉快的一天。