如何对变体长度特征进行一种热编码？答案

【问题标题】：How to one hot encode variant length features?如何对变体长度特征进行一种热编码？
【发布时间】：2017-07-12 11:35:49
【问题描述】：

给定一个变体长度特征列表：

features = [
    ['f1', 'f2', 'f3'],
    ['f2', 'f4', 'f5', 'f6'],
    ['f1', 'f2']
]

其中每个样本都有不同数量的特征，特征 dtype 是 str 并且已经很热门了。

为了使用 sklearn 的特征选择实用程序，我必须将 features 转换为二维数组，如下所示：

    f1  f2  f3  f4  f5  f6
s1   1   1   1   0   0   0
s2   0   1   0   1   1   1
s3   1   1   0   0   0   0

我如何通过 sklearn 或 numpy 实现它？

【问题讨论】：

标签： python pandas numpy scikit-learn

【解决方案1】：

这是一种使用 NumPy 方法并输出为 pandas 数据框的方法 -

import numpy as np
import pandas as pd

lens = list(map(len, features))
N = len(lens)
unq, col = np.unique(np.concatenate(features),return_inverse=1)
row = np.repeat(np.arange(N), lens)
out = np.zeros((N,len(unq)),dtype=int)
out[row,col] = 1

indx = ['s'+str(i+1) for i in range(N)]
df_out = pd.DataFrame(out, columns=unq, index=indx)

样本输入、输出-

In [80]: features
Out[80]: [['f1', 'f2', 'f3'], ['f2', 'f4', 'f5', 'f6'], ['f1', 'f2']]

In [81]: df_out
Out[81]: 
    f1  f2  f3  f4  f5  f6
s1   1   1   1   0   0   0
s2   0   1   0   1   1   1
s3   1   1   0   0   0   0

【讨论】：

【解决方案2】：

您可以使用 scikit 中专门用于执行此操作的 MultiLabelBinarizer。

示例代码：

features = [
            ['f1', 'f2', 'f3'],
            ['f2', 'f4', 'f5', 'f6'],
            ['f1', 'f2']
           ]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
new_features = mlb.fit_transform(features)

输出：

array([[1, 1, 1, 0, 0, 0],
       [0, 1, 0, 1, 1, 1],
       [1, 1, 0, 0, 0, 0]])

这也可以与其他 feature_selection 实用程序一起用于管道中。

【讨论】：