你的例子
没有数据来测试它,我无法测试它,但下面应该可以工作(将您的“email_column_name”替换为电子邮件列的名称):
dates = pd.get_dummies(
roulette_data.set_index('email_column_name')['date']\
.str.split(';',expand=True)\
.stack().reset_index(level=1, drop=True)
)\
.reset_index().groupby('email_column_name').sum()
这是一个玩具示例:
df = pd.DataFrame({'col1':['record1', 'record2'],
'col2':["this is good text", "but this is even better"]}
)
df
# col1 col2
#0 record1 this is good text
#1 record2 but this is even better
我们首先将索引设置为col1,然后我们选择col2,这样我们就可以使用它的.str.split方法将行拆分为单个单词。
df.set_index('col1')['col2'].str.split(expand=True)
# 0 1 2 3 4
#col1
#record1 this is good text None
#record2 but this is even better
然后我们用stack改变形状,reset_index去掉不必要的索引级别
df.set_index('col1')['col2'].str.split(expand=True)\
.stack().reset_index(level=1, drop=True)
#col1
#record1 this
#record1 is
#record1 good
#record1 text
#record2 but
#record2 this
#record2 is
#record2 even
#record2 better
#dtype: object
我们将整个表达式包装在 pd.get_dummies() 中
pd.get_dummies(df.set_index('col1')['col2'].str.split(expand=True).stack().reset_index(level=1, drop=True))
# better but even good is text this
#col1
#record1 0 0 0 0 0 0 1
#record1 0 0 0 0 1 0 0
#record1 0 0 0 1 0 0 0
#record1 0 0 0 0 0 1 0
#record2 0 1 0 0 0 0 0
#record2 0 0 0 0 0 0 1
#record2 0 0 0 0 1 0 0
#record2 0 0 1 0 0 0 0
#record2 1 0 0 0 0 0 0
最终结果
最后我们reset_index(即col1,或者在您的情况下为电子邮件列)、groupbycol1 并求和。
pd.get_dummies(
df.set_index('col1')['col2']\
.str.split(expand=True)\
.stack().reset_index(level=1, drop=True)
)\
.reset_index().groupby('col1').sum()
# better but even good is text this
#col1
#record1 0 0 0 1 1 1 1
#record2 1 1 1 0 1 0 1