使用 Python 进行特征工程答案

【问题标题】：Feature engineering using Python使用 Python 进行特征工程
【发布时间】：2019-10-11 02:25:26
【问题描述】：

我有一个 pandas 数据集，其中一列是这样的：

         Genre
        ------------
         Documentary
         Documentary
         Comedy|Mystery|Thriller
         Animation|Comedy|Family
         Documentary
         Documentary|Family
         Action|Adventure|Fantasy|Sci-Fi
         Crime|Drama|Mystery
         Action|Crime|Mystery|Thriller

如何使用每个流派名称创建多个列，如果它包含该流派，则填写 1，否则填写 0？

预期输出：Pandas 数据框

  Documentary  Comedy  Mystery  Thriller  Animation  Family  ......
    1           0       0          0        0          0
    1            0       0          0        0          0
    0            1        1         1        0          0

等等。

我尝试使用先将其转换为列表然后拆分它，但这不是 pythonic 的方式。

我们可以使用apply 函数或其他一些有效的技术来有效地做到这一点吗？

【问题讨论】：

get_dummies? pandas.pydata.org/pandas-docs/stable/reference/api/…

标签： python-3.x pandas feature-extraction

【解决方案1】：

使用Series.explode + pd.get_dummies:

s_explode=df['Genre'].str.split('|').explode()
dfc=pd.get_dummies(s_explode).groupby(level=0).sum()
new_df=pd.concat([df['Genre'],dfc],axis=1)
print(new_df)

                              Genre  Action  Adventure  Animation  Comedy  \
0                      Documentary       0          0          0       0   
1                      Documentary       0          0          0       0   
2          Comedy|Mystery|Thriller       0          0          0       1   
3          Animation|Comedy|Family       0          0          1       1   
4                      Documentary       0          0          0       0   
5               Documentary|Family       0          0          0       0   
6  Action|Adventure|Fantasy|Sci-Fi       1          1          0       0   
7              Crime|Drama|Mystery       0          0          0       0   
8    Action|Crime|Mystery|Thriller       1          0          0       0   

   Crime  Documentary  Drama  Family  Fantasy  Mystery  Sci-Fi  Thriller  
0      0            1      0       0        0        0       0         0  
1      0            1      0       0        0        0       0         0  
2      0            0      0       0        0        1       0         1  
3      0            0      0       1        0        0       0         0  
4      0            1      0       0        0        0       0         0  
5      0            1      0       1        0        0       0         0  
6      0            0      0       0        1        0       1         0  
7      1            0      1       0        0        1       0         0  
8      1            0      0       0        0        1       0         1

【讨论】：

太棒了。是否有任何书籍/参考资料或网站包含数据框的熊猫和 numpy 的深度知识。提前致谢。
data['genres'].str.split('|').explode() 给出错误：AttributeError: 'Series' object has no attribute 'explode'
你需要版本 pandas 0.25.1

【解决方案2】：

用str.get_dummies直接简单

df1 = df.Genre.str.get_dummies('|')

Out[385]:
   Action  Adventure  Animation  Comedy  Crime  Documentary  Drama  Family  \
0       0          0          0       0      0            1      0       0
1       0          0          0       0      0            1      0       0
2       0          0          0       1      0            0      0       0
3       0          0          1       1      0            0      0       1
4       0          0          0       0      0            1      0       0
5       0          0          0       0      0            1      0       1
6       1          1          0       0      0            0      0       0
7       0          0          0       0      1            0      1       0
8       1          0          0       0      1            0      0       0

   Fantasy  Mystery  Sci-Fi  Thriller
0        0        0       0         0
1        0        0       0         0
2        0        1       0         1
3        0        0       0         0
4        0        0       0         0
5        0        0       0         0
6        1        0       1         0
7        0        1       0         0
8        0        1       0         1

【讨论】：