如何从包含字符串的空间列表组成的熊猫列中进行一次热编码？答案

【问题标题】：How to one-hot-encode from a pandas column composed of a list of space containing strings?如何从包含字符串的空间列表组成的熊猫列中进行一次热编码？
【发布时间】：2020-09-30 16:55:23
【问题描述】：

我正在尝试将由字符串列表（包含空格）组成的单个 pandas 列转换为单热编码列（作为从 excel 读取的结果作为字符串输入）。我已经尝试过How to one-hot-encode from a pandas column containing a list? 的解决方案中讨论的方法，但它们在这种情况下不起作用——单个字母被分成几列。另一个条件是有时列表也可以为空。

更具体地说，对于数据帧 df

Col1   Col2         Col3
 C      4     "['Chocolate cake', 'White wine', 'Peanuts']"
 A      1.7   "[]"
 B      1     "['Chocolate cake', 'Salmon']"

我想创建：

new_df

Col1   Col2   Chocolate cake White wine Peanuts Salmon
 C      4             1           1        1      0
 A      1.7           0           0        0      0
 B      1             1           0        0      1

创建此数据框的 Python 方法是什么？

编辑：

生成 pandas 数据框的代码。

pd.DataFrame.from_dict({'Col1':['C', 'A', 'B'], 'Col2':[4, 1.7, 1], 'Col3':["['Chocolate cake', 'White wine', 'Peanuts']", \
                                                     "[]",  "['Chocolate cake', 'Salmon']" ]})

【问题讨论】：

你能用代码重现样本数据吗？
好的，请等一下。
@Divakar 添加了代码。

标签： python numpy scikit-learn

【解决方案1】：

这是一种方法-

import ast
dfC3 = [ast.literal_eval(i) for i in df.Col3]
ids,U = pd.factorize(np.concatenate(dfC3))
df_out = pd.DataFrame([np.isin(U,i) for i in dfC3], columns=U).astype(int)

样本 o/p -

In [50]: df_out
Out[50]: 
   Chocolate cake  Peanuts  Salmon  White wine
0               1        1       0           1
1               0        0       0           0
2               1        0       1           0

如果您需要将其与输入 df 连接，请使用 pd.concat([df,df_out],axis=1)。

使用数组赋值提高性能

如果大型数据集需要，我们可以使用array-assignment 来获得更高的性能（从早期方法中重新使用ids,U）-

lens = list(map(len,dfC3))
mask = np.zeros((len(lens),len(U)), dtype=int)
mask[np.repeat(range(len(lens)),lens), ids] = 1
df_out = pd.DataFrame(mask, columns=U)

【讨论】：

谢谢迪瓦卡。我对问题进行了关键更改，这阻止了这对我的用例起作用，对此表示歉意。不幸的是，由于 pd.read_excel 的工作方式，输入被编码为字符串。

【解决方案2】：

您只需使用ast.literal_eval 将Col3 转换为一系列列表

import ast
from sklearn.preprocessing import MultiLabelBinarizer

s = df['Col3'].map(ast.literal_eval)
mlb = MultiLabelBinarizer()

df_mlb = pd.DataFrame(mlb.fit_transform(s), columns=mlb.classes_, index=df.index)

Out[29]:
   Chocolate cake  Peanuts  Salmon  White wine
0               1        1       0           1
1               0        0       0           0
2               1        0       1           0

【讨论】：

【解决方案3】：

一种使用pd.Series.str.findall、join和get_dummies的方式：

new_df = df.pop("Col3").str.findall("'(.+?)'").str.join("|").str.get_dummies()
new_df = pd.concat([df, new_df], 1)
print(new_df)

输出：

  Col1  Col2  Chocolate cake  Peanuts  Salmon  White wine
0    C   4.0               1        1       0           1
1    A   1.7               0        0       0           0
2    B   1.0               1        0       1           0

【讨论】：