为了获得更好的性能,请使用MultiLabelBinarizer:
data = [["a", "b", "c"],
["b", "c"],
["d", "e", "f", "c"]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print (df)
a b c d e f
0 1 1 1 0 0 0
1 0 1 1 0 0 0
2 0 0 1 1 1 1
编辑:
data = [['a', 'about', 'absurd', 'again', 'an', 'associates', 'writes', 'wrote', 'x', 'york', 'you', 'your'], ['a', 'abiding', 'age', 'aggravated', 'aggressively', 'all', 'almost', 'alone', 'already', 'also', 'although']]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
print (df)
a abiding about absurd again age aggravated aggressively all \
0 1 0 1 1 1 0 0 0 0
1 1 1 0 0 0 1 1 1 1
almost ... also although an associates writes wrote x york you \
0 0 ... 0 0 1 1 1 1 1 1 1
1 1 ... 1 1 0 0 0 0 0 0 0
your
0 1
1 0
[2 rows x 22 columns]
纯熊猫解决方案是可能的,但我想它应该更慢:
df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print (df)
a b d c e f
0 1 1 0 1 0 0
1 0 1 0 1 0 0
2 0 0 1 1 1 1
df = pd.get_dummies(pd.DataFrame(data), prefix='', prefix_sep='').max(level=0, axis=1)
print (df)
a abiding about absurd age again aggravated aggressively an all \
0 1 0 1 1 0 1 0 0 1 0
1 1 1 0 0 1 0 1 1 0 1
... writes alone wrote already x also york although you your
0 ... 1 0 1 0 1 0 1 0 1 1
1 ... 0 1 0 1 0 1 0 1 0 0
[2 rows x 22 columns]