如何在 scikit-learn 中为多类逻辑回归准备单热编码？答案

【问题标题】：How to prepare a one-hot encoding in scikit-learn for a multiclass logistic regression?如何在 scikit-learn 中为多类逻辑回归准备单热编码？
【发布时间】：2020-07-14 11:36:26
【问题描述】：

我正在尝试使用 scikit-learn 中的 one-hot 编码从以下 DataFrame 中分类 4 个类：

          K   T_STAR                 REGIME
15   90.929  0.95524  BoilingInducedBreakup
9   117.483  0.89386                 Splash
16   97.764  1.17972  BoilingInducedBreakup
13   76.917  0.91399  BoilingInducedBreakup
6    44.889  0.95725  BoilingInducedBreakup
20  151.662  0.56287                 Splash
12   67.155  1.22842     ReboundWithBreakup
7   114.747  0.47618                 Splash
17  121.731  0.52956                 Splash
12   29.397  0.88702             Deposition
14   31.733  0.69154             Deposition
13  119.433  0.39422                 Splash
21   97.913  1.21309     ReboundWithBreakup
10  117.544  0.18538                 Splash
27   76.957  0.52879             Deposition
22  155.842  0.17559                 Splash
3    25.620  0.18680             Deposition
30  151.773  1.23027     ReboundWithBreakup
34   91.146  0.90138             Deposition
19   58.095  0.46110             Deposition
14   85.596  0.97520  BoilingInducedBreakup
41   97.783  0.16985             Deposition
0    16.683  0.99355             Deposition
28  122.022  1.22977     ReboundWithBreakup
0    25.570  1.24686     ReboundWithBreakup
3   113.315  0.48886                 Splash
7    31.873  1.30497     ReboundWithBreakup
0   108.488  0.73423                 Splash
2    25.725  1.29953     ReboundWithBreakup
37   97.695  0.50930             Deposition

这是 CSV 格式的示例：

,K,T_STAR,REGIME
15,90.929,0.95524,BoilingInducedBreakup
9,117.483,0.89386,Splash
16,97.764,1.17972,BoilingInducedBreakup
13,76.917,0.91399,BoilingInducedBreakup
6,44.889,0.95725,BoilingInducedBreakup
20,151.662,0.56287,Splash
12,67.155,1.22842,ReboundWithBreakup
7,114.747,0.47618,Splash
17,121.731,0.52956,Splash
12,29.397,0.88702,Deposition
14,31.733,0.69154,Deposition
13,119.433,0.39422,Splash
21,97.913,1.21309,ReboundWithBreakup
10,117.544,0.18538,Splash
27,76.957,0.52879,Deposition
22,155.842,0.17559,Splash
3,25.62,0.1868,Deposition
30,151.773,1.23027,ReboundWithBreakup
34,91.146,0.90138,Deposition
19,58.095,0.4611,Deposition
14,85.596,0.9752,BoilingInducedBreakup
41,97.783,0.16985,Deposition
0,16.683,0.99355,Deposition
28,122.022,1.22977,ReboundWithBreakup
0,25.57,1.24686,ReboundWithBreakup
3,113.315,0.48886,Splash
7,31.873,1.30497,ReboundWithBreakup
0,108.488,0.73423,Splash
2,25.725,1.29953,ReboundWithBreakup
37,97.695,0.5093,Deposition

特征向量是二维的(K,T_STAR) 和REGIMES 是类别，没有以任何方式排序。

这就是我迄今为止为 one-hot 编码和缩放所做的：

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler 
from sklearn.preprocessing import OneHotEncoder 
num_attribs = ["K", "T_STAR"] 
cat_attribs = ["REGIME"]
preproc_pipeline = ColumnTransformer([("num", MinMaxScaler(), num_attribs),
                                      ("cat", OneHotEncoder(),  cat_attribs)])
regimes_df_prepared = preproc_pipeline.fit_transform(regimes_df)

但是，当我打印 regimes_df_prepared 的前几行时，我得到了

array([[0.73836403, 0.19766192, 0.        , 0.        , 0.        ,
        1.        ],
       [0.43284301, 0.65556065, 1.        , 0.        , 0.        ,
        0.        ],
       [0.97076007, 0.93419198, 0.        , 0.        , 1.        ,
        0.        ],
       [0.96996242, 0.34623652, 0.        , 0.        , 0.        ,
        1.        ],
       [0.10915571, 1.        , 0.        , 0.        , 1.        ,
        0.        ]])

所以 one-hot 编码似乎有效，但问题是特征向量与此数组中的编码打包在一起。

如果我尝试像这样训练模型：

from sklearn.linear_model import LogisticRegression

logreg_ovr = LogisticRegression(solver='lbfgs', max_iter=10000, multi_class='ovr')
logreg_ovr.fit(regimes_df_prepared, regimes_df["REGIME"])
print("Model training score : %.3f" % logreg_ovr.score(regimes_df_prepared, regimes_df["REGIME"]))

分数是1.0，不能（过拟合？）。

现在我希望模型在 (K, T_STAR) 对中预测一个类别

logreg_ovr.predict([[40,0.6]])

我得到一个错误

ValueError: X has 2 features per sample; expecting 6

正如怀疑的那样，模型将regimes_df_prepared 的整行视为特征向量。我怎样才能避免这种情况？

【问题讨论】：

标签： scikit-learn one-hot-encoding multiclass-classification

【解决方案1】：

目标标签不应该是一次性编码的，sklearn 有LabelEncoder 。在您的情况下，数据预处理的工作代码类似于：

X,y = regimes_df[num_attribs].values,regimes_df['REGIME'].values
y = LabelEncoder().fit_transform(y)

我注意到您正在计算用于训练模型的相同数据的分数，这自然会导致过度拟合。请使用 train_test_split 或 cross_val_score 之类的东西来正确评估模型的性能。

【讨论】：

好的，如果类别的顺序根本不重要（非序数类别），这样使用 LabelEncoder 会不会有问题？模型会按照整数标签的顺序拾取吗？
LabelEncoder 不捕获任何排序，即使它是整数标签。这仅适用于标签之间的排序不（或不应该）存在的情况
谢谢。我知道 LabelEncoder 不捕获排序，但我的问题是模型是否捕获标签中的排序。
不，没有分类器这样做，即使是在幕后使用回归的分类器。捕获标签排序的模型在Ordinal Regression 下研究，sklearn 不支持