sklearn LabelEncoder中的标签不一致？答案

【问题标题】：Inconsistent labeling in sklearn LabelEncoder?sklearn LabelEncoder中的标签不一致？
【发布时间】：2016-11-20 19:40:31
【问题描述】：

我在数据帧上应用了LabelEncoder()，它返回以下内容：

order/new_carts 有不同的标签编码数字，例如70, 64, 71, etc

这是不一致的标签，还是我在某处做错了什么？

【问题讨论】：

标签： python pandas scikit-learn

【解决方案1】：

LabelEncoder 适用于一维数组。如果将其应用于多个列，则它将在列内保持一致，但跨列不保持一致。

作为一种解决方法，您可以将数据帧转换为一维数组并在该数组上调用 LabelEncoder。

假设这是数据框：

df
Out[372]: 
   0  1  2
0  d  d  a
1  c  a  c
2  c  c  b
3  e  e  d
4  d  d  e
5  d  b  e
6  e  e  b
7  a  e  b
8  b  c  c
9  e  a  b

先解开，然后再整形：

pd.DataFrame(LabelEncoder().fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)
Out[373]: 
   0  1  2
0  3  3  0
1  2  0  2
2  2  2  1
3  4  4  3
4  3  3  4
5  3  1  4
6  4  4  1
7  0  4  1
8  1  2  2
9  4  0  1

编辑：

如果要存储标签，需要保存LabelEncoder对象。

le = LabelEncoder()
df2 = pd.DataFrame(le.fit_transform(df.values.ravel()).reshape(df.shape), columns = df.columns)

现在，le.classes_ 为您提供课程（从 0 开始）。

le.classes_
Out[390]: array(['a', 'b', 'c', 'd', 'e'], dtype=object)

如果要通过标签访问整数，可以构造一个dict：

dict(zip(le.classes_, np.arange(len(le.classes_))))
Out[388]: {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}

你可以用 transform 方法做同样的事情，而无需构建字典：

le.transform('c')
Out[395]: 2

【讨论】：

感谢您的回答。有什么办法可以获得字符串的映射及其对应的编码标签？喜欢order/new_cart <-- 103?
非常感谢。但是，le.classes_ 返回以下错误：AttributeError: 'LabelEncoder' object has no attribute 'classes_'
那是我的错误，对不起。分配le = LabelEncoder 后，您需要在该对象上调用 fit_transform。请再看一下以edit开头的部分。

【解决方案2】：

您的 LabelEncoder 对象正在重新适应 DataFrame 的每一列。

由于 apply 和 fit_transform 函数的工作方式，您会不小心在框架的每一列上调用 fit 函数。让我们看看下面这行发生了什么：

labeled_df = String_df.apply(LabelEncoder().fit_transform)

创建一个新的LabelEncoder 对象
调用apply，传入fit_transform 方法。对于DataFrame 中的每一列，它将在您的编码器上调用fit_transform，并将列作为参数传递。这做了两件事：
A. 改装你的编码器（修改它的状态） B. 根据您的编码器新配件返回列元素的代码。

因为每次调用 fit_transform LabelEncoder 对象都可以选择新的转换代码，所以代码不会在列之间保持一致。

如果您希望您的代码在各列之间保持一致，您应该将 LabelEncoder 适合您的整个数据集。

然后将转换函数传递给您的应用函数，而不是 fit_transform 函数。您可以尝试以下方法：

encoder = LabelEncoder()
all_values = String_df.values.ravel() #convert the dataframe to one long array
encoder.fit(all_values)
labeled_df = String_df.apply(encoder.transform)

【讨论】：