【发布时间】:2021-04-12 09:47:03
【问题描述】:
我想知道 pandas 的 get_dummies() 分类特征编码与 sklearn 的 OneHotEncoder() 相比有什么区别。
我看到的答案提到 get_dummies() 无法为训练数据集中 (answers here) 中未见的类别生成编码。但是,这是在测试和训练数据集上分别执行get_dummies() 的结果(这可能会给出不一致的形状)。另一方面,如果我们在原始数据集上应用get_dummies(),在拆分它之前,我认为这两种方法应该给出相同的结果。我错了吗?这会导致问题吗?
我的代码目前的工作方式如下:
def one_hot_encode(ds,feature):
#get DF of dummy variables
dummies = pd.get_dummies(ds[feature])
#One dummy variable to drop (Dummy Trap)
dummyDrop = dummies.columns[0]
#Create a DF from the original and the dummies' DF
#Drop the original categorical variable and the one dummy
final = pd.concat([ds,dummies], axis='columns').drop([feature,dummyDrop], axis='columns')
return final
#Get data DF
dataset = pd.read_csv("census_income_dataset.csv")
columns = dataset.columns
#Perform one-hot-encoding on the DF (See function above) on categorical features
features = ["workclass","marital_status","occupation","relationship","race","sex","native_country"]
for f in features:
dataset = one_hot_encode(dataset,f)
#Re-order to get ouput feature in last column
dataset = dataset[[c for c in dataset.columns if c!="income_level"]+["income_level"]]
dataset.head()
【问题讨论】:
标签: python training-data sklearn-pandas one-hot-encoding