如何匹配该值并根据其他列字符串为它们分配一个新列答案

【问题标题】：How can I match the value and assign to them a new column based on other column string如何匹配该值并根据其他列字符串为它们分配一个新列
【发布时间】：2021-05-22 03:42:31
【问题描述】：

数据框：

MovieID	movieCater	rating
1	Action, Comedy, Adventure	4
2	Action, Crime	3
3	Crime	2

我想要什么：

MovieID	movieCater	Action	Comedy	Adventure	Crime
1	Action, Comedy, Adventure	4	4	4	0
2	Action, Crime	3	0	0	3
3	Crime	0	0	0	2

【问题讨论】：

首先，您需要拆分字符串以从 movieCater 字段中获取值，然后进行规范化，最后进行数据透视 - 您可以在此处阅读 Pivot 文档：pandas.pydata.org/pandas-docs/stable/reference/api/…

标签： python pandas

【解决方案1】：

您也可以将get_dummies 列movieCater 并乘以评分：

out = df.join(df['movieCater'].str.get_dummies(",").mul(df['rating'],axis=0))

print(out)

   MovieID               movieCater  rating  Action  Adventure  Comedy  Crime
0        1  Action,Comedy,Adventure       4       4          4       4      0
1        2             Action,Crime       3       3          0       0      3
2        3                    Crime       2       0          0       0      2

要匹配预期输出，请在加入前删除评分列：

out = (df.drop("rating",1).join(
      df['movieCater'].str.get_dummies(",").mul(df['rating'],axis=0))

如果系列非常非常大，您还可以考虑使用 sep="," 拆分字符串，然后使用以下解决方案：https://stackoverflow.com/a/51420716/9840637 来获取假人。最后乘以评分列。

【讨论】：

嗨，我更新了问题并为问题添加了图像，但是，我不知道如何显示描述，也许你通过编辑检查我的问题
@Leo - 需要out = df.join(df['movieCater'].str.join(',').str.get_dummies(",").mul(df['rating'],axis=0)) 吗？
@anky - 对于第二个解决方案 df.join(df['movieCater'].str.get_dummies(",").mul(df.pop('rating'),axis=0)) 应该可以工作

【解决方案2】：

这是一种使用.pivot_table() 方法的方法。

首先，您必须使用.apply() 方法将movieCater 列转换为列表。

>>> df
   MovieID                 movieCater  rating
0        1  Action, Comedy, Adventure       4
1        2              Action, Crime       3
2        3                      Crime       2
>>> df.assign(movieCater_list = df['movieCater'].apply(lambda x: x.split(', ')))
   MovieID                 movieCater  rating              movieCater_list
0        1  Action, Comedy, Adventure       4  [Action, Comedy, Adventure]
1        2              Action, Crime       3              [Action, Crime]
2        3                      Crime       2                      [Crime]

接下来，在movieCater_list 列上使用.explode() 方法。

>>> df.assign(movieCater_list = df['movieCater'].apply(lambda x: x.split(', '))).explode('movieCater_list')
   MovieID                 movieCater  rating movieCater_list
0        1  Action, Comedy, Adventure       4          Action
0        1  Action, Comedy, Adventure       4          Comedy
0        1  Action, Comedy, Adventure       4       Adventure
1        2              Action, Crime       3          Action
1        2              Action, Crime       3           Crime
2        3                      Crime       2           Crime

终于用.pivot_table()

>>> df.assign(movieCater_list = df['movieCater'].apply(lambda x: x.split(', '))).explode('movieCater_list').pivot_table(values='rating', index=['MovieID', 'movieCater'], columns='movieCater_list', fill_value=0)
movieCater_list                    Action  Adventure  Comedy  Crime
MovieID movieCater
1       Action, Comedy, Adventure       4          4       4      0
2       Action, Crime                   3          0       0      3
3       Crime                           0          0       0      2

【讨论】：

嗨，我更新了问题并为问题添加了图片，但是，我不知道如何显示描述，也许你通过编辑检查我的问题。
刚刚检查了您的编辑。使用 anky 解决方案中的pd.get_dummies() 是解决您问题的更好方法。 .pivot_table() 依赖于聚合（默认为mean），当index 参数中可能存在重复时会导致问题。

【解决方案3】：

这是一个您可以尝试的解决方案，首先是基于分隔符的split，然后是explode，最后是pivot_table

print(
    df.assign(movieCater=df['movieCater'].str.split(","))
        .explode(column='movieCater')
        .pivot_table(index='MovieID', columns='movieCater', values='rating', fill_value=0)
)

movieCater  Action  Adventure  Comedy  Crime
MovieID                                     
1                4          4       4      0
2                3          0       0      3
3                0          0       0      2

【讨论】：

【解决方案4】：

假设输入数据帧是

df = pd.DataFrame({
    'MovieID': ['001','002','003'], 
    'movieCat': ['Action, Comedy, Adventure', 'Action, Crime', 'Crime'], 
    'rating': [4,3,2]
    })

#output
    MovieID movieCat                    rating
0   001     Action, Comedy, Adventure   4
1   002     Action, Crime               3
2   003     Crime                       2

我已经重复使用@sushanth 的类似代码来获得决赛桌

df['temp'] = df['movieCat'].str.split(", ")
df = df.explode(column='temp').pivot_table(index=['MovieID', 'movieCat'], columns='temp', values='rating', fill_value=0)
df.columns.name=None
df.reset_index(inplace=True)

#output
    MovieID movieCat                    Adventure   Comedy  Crime   Action
0   001     Action, Comedy, Adventure   4           4       0       4
1   002     Action, Crime               0           0       3       3
2   003     Crime                       0           0       2       0

之后，您可以将表格导出到 Excel

df.to_excel('my_file.xlsx', index=False)

【讨论】：