解码熊猫数据框答案

【问题标题】：Decode pandas dataframe解码熊猫数据框
【发布时间】：2018-04-23 09:21:35
【问题描述】：

我有一个编码的数据帧。我使用来自 scitkit-learn 的 labelEncoder 对其进行编码，创建一个机器学习模型并进行了一些预测。但现在我无法解码 pandas 数据帧中的输出值。我用文档中的 inverse_transform 尝试了几次，但我仍然每次都收到类似的错误

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()`

这就是我的数据框的样子：

    0   147 14931   9   0   0   1   0   0   0   4   ... 0   0   242 677 0   94  192 27  169 20
    1   146 14955   15  1   0   0   0   0   0   0   ... 0   1   63  42  0   94  192 27  169 20
    2   145 15161   25  1   0   0   0   1   0   5   ... 0   0   242 677 0   94  192 27  169 20

这就是我在必要时如何对其进行编码的代码：

labelEncoder = preprocessing.LabelEncoder()
for col in b.columns:
    b[col] = labelEncoder.fit_transform(b[col])

列名是不必要的。我还尝试了 lambda 函数，这里的另一个问题中显示了它，但它仍然不起作用。我做错了什么？感谢您的帮助！

编辑： 在 Vivek Kumars 代码实施后，我收到以下错误：

KeyError: 'Predicted_Values'

那是我添加到数据框中的一列，只是为了表示预测值。我通过以下方式做到这一点：

b = pd.concat([X_test, y_test], axis=1)  # features and actual predicted values
b['Predicted_Values'] = y_predict

这就是我如何从将位于 y 轴上的数据框中删除列并选择适合估计器的方式：

from sklearn.cross_validation import train_test_split
X = b.drop(['Activity_Profile'],axis=1)
y = b['Activity_Profile']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state=0)
model = tree.DecisionTreeClassifier()
model = model.fit(X_train, y_train)

【问题讨论】：

LabeEncoder 用于将单个列（主要是字符串）编码为整数。仅在该列上尝试 inverse_transform，而不是整个 DF。您对数据进行编码的代码也会有所帮助。
刚刚添加 :)
那是你在那里犯的一个错误。每次调用 fit() 或 fit_transform() 只会记住该列信息并忘记所有以前的转换。所以现在使用那个编码器你只能解码数据帧的最后一列，而不是全部。

标签： python pandas scikit-learn decode

【解决方案1】：

您可以在这里查看我的答案，以了解 LabelEncoder 用于多列的正确用法：-

Why does sklearn preprocessing LabelEncoder inverse_transform apply from only one column?

解释是LabelEncoder只支持单维作为输入。因此，对于每一列，您需要有一个不同的 labelEncoder 对象，然后可以使用该对象仅对该特定列进行逆变换。

您可以使用 labelencoder 对象的字典来转换多个列。像这样的：

labelencoder_dict = {}
for col in b.columns:
    labelEncoder = preprocessing.LabelEncoder()
    b[col] = labelEncoder.fit_transform(b[col])
    labelencoder_dict[col]=labelEncoder

解码时，你可以使用：

for col in b.columns:
    b[col] = labelencoder_dict[col].inverse_transform(b[col])

更新：-

现在您已将使用的列添加为y，下面是您如何对其进行解码（假设您已将“Predicted_Values”列添加到数据框）：

for col in b.columns:
    # Skip the predicted column here
    if col != 'Predicted_valu‌es':
        b[col] = labelencoder_dict[col].inverse_transform(b[col])

# Use the original `y (Activity_Profile)` encoder on predicted data
b['Predicted_valu‌es'] = labelencoder_dict['Activity_Profile'].inverse_transfo‌rm(
                                                      b['Predicted_valu‌es'])

【讨论】：

首先感谢您的回答。但现在我得到了 Error: TypeError: '>' not supported between 'int' 和 'str' 的实例。这让我很困惑：D
@sataide 请编辑问题以包括错误的完整堆栈跟踪和抛出它的代码。如果您添加一些导致此错误的原始数据，那将有所帮助。
好吧，我添加了预测值所在的列。这很奇怪，因为现在错误是列上的 KeyError。我会编辑它。
我是否只将预测值的副本添加到数据帧中？我的意思是这些值仍然被编码。
@sataide 是的，因为在我们的标签编码器字典中，没有'Predicted_Values' 的键。解码时跳过那个键。