One Hot Encoding 和 pandas.categorical.code 有什么区别答案

【问题标题】：What is difference between One Hot Encoding and pandas.categorical.codeOne Hot Encoding 和 pandas.categorical.code 有什么区别
【发布时间】：2021-04-15 21:10:09
【问题描述】：

我正在解决一些问题，但有如下疑问：

在数据集中有一个文本列具有以下唯一值：

array(['1 bath', 'na', '1 shared bath', '1.5 baths', '1 private bath',
       '2 baths', '1.5 shared baths', '3 baths', 'Half-bath',
       '2 shared baths', '2.5 baths', '0 shared baths', '0 baths',
       '5 baths', 'Private half-bath', 'Shared half-bath', '4.5 baths',
       '5.5 baths', '2.5 shared baths', '3.5 baths', '15.5 baths',
       '6 baths', '4 baths', '3 shared baths', '4 shared baths',
       '3.5 shared baths', '6 shared baths', '6.5 shared baths',
       '6.5 baths', '4.5 shared baths', '7.5 baths', '5.5 shared baths',
       '7 baths', '8 shared baths', '5 shared baths', '8 baths',
       '10 baths', '7 shared baths'], dtype=object)

如果我使用 Count Vectorize 将它们转换为一种热编码，

vectorizer = CountVectorizer()
vectorizer.fit(X_train[colname].values)

我收到以下错误：

AttributeError: 'float' 对象没有属性 'lower'

请告诉我错误的原因。

我可以用它代替吗：

pd.Categorical(_DF_LISTING_EDA.bathrooms_text).codes

One hot encoding 和 pd.categorical.code 有什么区别？

谢谢阿米特·莫迪

【问题讨论】：

您当前使用CountVectorizer 的代码与一种热编码无关。一种热编码也不是计数矢量化。你想做什么？
我正在尝试将这些分类数据转换为 One hot encoding

标签： python pandas scikit-learn categorical-data one-hot-encoding

【解决方案1】：

CountVectorizer 不是一种热编码
Pandas Categorical 不是一种热编码

如果你想使用 pandas 进行一种热编码，你可以这样做：

pandas.get_dummies(X_train[colname])[0]

【讨论】：

感谢您的回复。我已经阅读了 Sk-Learn 中一个热门编码器的文档。在一个 ho 编码器中，我们将我们的数据集拆分为测试列车并在列车数据上训练一个热编码器并从测试数据中获取输出。这样就可以处理不可见的数据，pandas.get_dummies 也可以处理不可见的数据吗？
不，您需要为此做一些额外的工作，例如制作一些编码的火车变量的字典，如果没有新变量，则将其添加为“其他”