【问题标题】:Issues with One Hot Encoding for model with values not in training data值不在训练数据中的模型的 One Hot Encoding 问题
【发布时间】:2022-01-01 14:41:51
【问题描述】:

我想对我的简单模型使用 One Hot Encoding。然而,无论我如何设置它似乎都会触发错误。首先,即使我有 1.0.2 版的 sklearn,One Hot Encoding 也不会将字符串转换为浮点数。现在的问题是因为我的训练数据中的值与测试数据中的长度不同。训练只有 2 个值,测试有全部三个。我该如何解决?确切的错误是一系列的真值不明确。这种其他想法的错误是重塑数据。

import lightgbm as lgbm 
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
X = [[ 'apple',5],['banana',1],['apple',6],['banana',2]]
X=pd.DataFrame(X).to_numpy()
test = [[ 'pineapple',0],['banana',1],['apple',7],['banana,2']]
y = [1,0,1,0]
y=pd.DataFrame(y).to_numpy()
    
labels = ['apples','bananas','pineapple']
ohc = OneHotEncoder(categories=labels)
pp = ColumnTransformer(
                        transformers=[('ohc', ohc, [0])]
                        ,remainder = 'passthrough')
model=lgbm.LGBMClassifier()
mymodel = Pipeline(steps = [('preprocessor', pp),
                                ('model', model)
                                ])

params = {'model__learning_rate':[0.1]
          ,'model__n_estimators':[2]}
lgbm_gs=GridSearchCV(
    estimator = mymodel, param_grid=params, n_jobs = -1,
    cv=2, scoring='accuracy'
    ,verbose=-1)
lgbm_gs.fit(X,y)

【问题讨论】:

    标签: scikit-learn one-hot-encoding


    【解决方案1】:

    问题应该与您将categories 作为列表而不是类似数组的列表(例如列表列表)作为doc 状态。因此,以下调整应该可以解决它。

    import lightgbm as lgbm 
    import pandas as pd
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.model_selection import GridSearchCV
    
    X = [['apple',5],['banana',1],['apple',6],['banana',2]]
    X = pd.DataFrame(X).to_numpy()
    test = [['pineapple',0],['banana',1],['apple',7],['banana',2]]
    y = [1,0,1,0]
    y = pd.DataFrame(y).to_numpy()
    labels = [['apple', 'banana', 'pineapple']]   # observe you were also mispelling categories ('apples' --> 'apple'; 'bananas' --> 'banana')
    ohc = OneHotEncoder(categories=labels)
    pp = ColumnTransformer(transformers=[('ohc', ohc, [0])], remainder='passthrough')
    model=lgbm.LGBMClassifier()
    mymodel = Pipeline(steps = [('preprocessor', pp),
                                ('model', model)])
    
    params = {'model__learning_rate':[0.1], 'model__n_estimators':[2]}
    lgbm_gs=GridSearchCV(
        estimator = mymodel, param_grid=params, n_jobs = -1,
        cv=2, scoring='accuracy', verbose=-1)
    lgbm_gs.fit(X, y.ravel())
    

    进一步说明,在处理测试数据的类别在训练集中找不到的情况时,请注意guide 的建议。

    如果训练数据有可能缺少分类特征,通常最好指定 handle_unknown='ignore' 而不是像上面那样手动设置类别。当指定了 handle_unknown='ignore' 并且在转换过程中遇到未知类别时,不会引发错误,但此功能的结果 one-hot 编码列将全为零(handle_unknown='ignore' 仅支持 one-hot 编码):

    最终,您可以观察到属性categories_(它指定在拟合期间确定的每个特征的类别)是一个数组列表(这里是单个数组,因为您只对一列进行单热编码), 也。以categories='auto' 为例:

    ohc = OneHotEncoder(handle_unknown='ignore')
    ohc.fit(X[:, 0].reshape(-1, 1)).categories_
    # Output: [array(['apple', 'banana'], dtype=object)]
    

    以您的自定义categories 为例:

    ohc = OneHotEncoder(categories=labels)
    ohc.fit(X[:, 0].reshape(-1, 1)).categories_
    # Output: [array(['apple', 'banana', 'pineapple'], dtype=object)]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-12-11
      • 2017-12-14
      • 2021-11-02
      • 1970-01-01
      • 2019-01-01
      • 2020-01-21
      • 2020-04-13
      • 1970-01-01
      相关资源
      最近更新 更多