标签编码器编码数据帧而不编码 NaN 缺失值答案

【问题标题】：label-encoder encoding a dataframe without encoding NaN missing values标签编码器编码数据帧而不编码 NaN 缺失值
【发布时间】：2019-01-30 18:19:36
【问题描述】：

我有一个包含数值、分类和 NaN 值的数据框。

    customer_class  B   C
0   OM1            1    2.0
1   NaN        6    1.0
2   OM1            9    NaN
....

我需要一个 LabelEncoder 将我的缺失值保持为“NaN”，以便之后使用 Imputer。

所以我想使用此代码通过保持 NaN 值来编码我的数据帧。

这里是代码：

   class LabelEncoderByCol(BaseEstimator, TransformerMixin):
    def __init__(self,col):
        #List of column names in the DataFrame that should be encoded
        self.col = col
        #Dictionary storing a LabelEncoder for each column
        self.le_dic = {}
        for el in self.col:
            self.le_dic[el] = LabelEncoder()

    def fit(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            self.le_dic[el].fit(a)
        return self

    def transform(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            #Store an ndarray of the current column
            b = x[el].get_values()
            #Replace the elements in the ndarray that are not 'NaN'
            #using the transformer
            b[b!='NaN'] = self.le_dic[el].transform(a)
            #Overwrite the column in the DataFrame
            x[el]=b
        #return the transformed D


col = data1['customer_class']
LabelEncoderByCol(col)
LabelEncoderByCol.fit(x=col,y=None)

但是我收到了这个错误：第846章 --> 847 raise ValueError('%s 未包含在索引中' % str(key[mask])) 第848章第849章

ValueError: ['OM1' 'OM1' 'OM1' ... 'other' 'EU' 'EUB'] 未包含在索引中

有什么办法解决这个错误吗？

谢谢

【问题讨论】：

标签： python pandas class dataframe

【解决方案1】：

当我尝试复制时，我突然想到了两件事：

您的代码似乎期望将数据帧传递给您的类。但是在您的示例中，您通过了一系列。我通过在将系列传递给您的类之前将其包装为数据框来解决此问题：col = pd.DataFrame(data1['customer_class'])。
在您的班级的__init__ 方法中，您似乎打算遍历列名列表，但实际上是逐个系列地遍历您的所有列。我通过将相应的行更改为：self.col = col.columns.values 来解决此问题。

下面，我粘贴了我对您类的__init__ 和fit 方法的修改（我对transform 方法的唯一修改是让它返回修改后的数据帧）：

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder

data1 = pd.DataFrame({'customer_class': ['OM1', np.nan, 'OM1'],
                      'B': [1,6,9],
                      'C': [2.0, 1.0, np.nan]})

class LabelEncoderByCol(BaseEstimator, TransformerMixin):
    def __init__(self,col):
        #List of column names in the DataFrame that should be encoded
        self.col = col.columns.values
        #Dictionary storing a LabelEncoder for each column
        self.le_dic = {}
        for el in self.col:
            self.le_dic[el] = LabelEncoder()

    def fit(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x = x.fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            self.le_dic[el].fit(a)
        return self

    def transform(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            #Store an ndarray of the current column
            b = x[el].get_values()
            #Replace the elements in the ndarray that are not 'NaN'
            #using the transformer
            b[b!='NaN'] = self.le_dic[el].transform(a)
            #Overwrite the column in the DataFrame
            x[el]=b
        return x

我能够毫无错误地运行以下几行（也对您的初始实现稍作修改）：

col = pd.DataFrame(data1['customer_class'])
lenc = LabelEncoderByCol(col)
lenc.fit(x=col,y=None)

然后我可以从您的示例中访问 customer_class 列的类：

lenc.fit(x=col,y=None).le_dic['customer_class'].classes_

哪些输出：

array(['OM1'], dtype=object)

最后，我可以使用您的班级的transform 方法转换列：

lenc.transform(x=col,y=None)

输出如下：

    customer_class
0   0
1   NaN
2   0

【讨论】：