将数字特征转换为分类特征答案

【问题标题】：Converting numeric feature into categorical feature将数字特征转换为分类特征
【发布时间】：2020-01-03 21:19:09
【问题描述】：

我正在解决一个问题，以根据历史数据预测未来电子商店的销售额。我正在使用的功能之一是item price（浮动）。我通过实验发现，将其添加到现有功能列表中会降低我的xgboost 模型的拟合和验证精度（提高预测 RMSE）。我怀疑价格的影响可能是高度非线性的，记忆棒、笔记本电脑、手机等的价格会达到峰值。

无论如何，我有以下想法来解决这个问题：如果我将浮动 item price 转换为分类变量，能够指定映射，例如值范围或十分位数?然后，我可以使用训练目标值item pricemean-encode该分类变量。

这有意义吗？你能给我一个指向 Python“线性/十分位直方图”的指针，它返回一个浮点数列表，返回每个浮点数属于哪个 bin/decile 的并行列表？

【问题讨论】：

标签： python python-3.x pandas numpy histogram

【解决方案1】：

恕我直言，您可以使用qcut、KBinsDiscretizer 或cut。

一些例子，

>>> df = pd.DataFrame(np.random.randn(10), columns=['a'])
>>> df
          a
0  0.060278
1 -0.618677
2 -0.472467
3  1.539958
4 -0.181974
5  1.563588
6 -1.693140
7  1.868881
8  1.072179
9  0.575978

对于qcut，

>>> df['cluster'] = pd.qcut(df.a, 5, labels=range(1, 6))
>>> df
          a cluster
0  0.060278       3
1 -0.618677       1
2 -0.472467       2
3  1.539958       4
4 -0.181974       2
5  1.563588       5
6 -1.693140       1
7  1.868881       5
8  1.072179       4
9  0.575978       3

对于KBinsDiscretizer，

>>> (df['cluster'] = 
     KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
     .fit_transform(df.a.values.reshape(-1, 1)))
>>> df
          a  cluster
0  0.060278      1.0
1 -0.618677      0.0
2 -0.472467      0.0
3  1.539958      2.0
4 -0.181974      1.0
5  1.563588      2.0
6 -1.693140      0.0
7  1.868881      2.0
8  1.072179      2.0
9  0.575978      1.0

对于cut，

>>> df['cluster'] = pd.cut(df.a, 5, labels=range(1, 6))
>>> df
          a cluster
0  0.060278       3
1 -0.618677       2
2 -0.472467       2
3  1.539958       5
4 -0.181974       3
5  1.563588       5
6 -1.693140       1
7  1.868881       5
8  1.072179       4
9  0.575978       4

最后，看看可能有用：What is the difference between pandas.qcut and pandas.cut?

【讨论】：