在python中对列进行分组答案

【问题标题】：grouping columns in python在python中对列进行分组
【发布时间】：2021-09-13 22:58:59
【问题描述】：

您好，我正在使用 python 处理 .csv 中的数据集，并且在对列进行分组时遇到错误。我正在使用的代码是：

import pandas as pd 
df=pd.read_excel('filepath')
df['Items'].str.split(',', expand=True)
df=df.groupby(['Items0', 'Items1','Items2', 'Items3', 'Items4', 'Items5' ]).size()
print(df)

当我运行 print(df) 时，我会得到 Items0-1、Items1-1、Items2-1 等值

这是我正在使用的示例数据，下面是我尝试如何组织它。

谁能指导我如何解决这个问题？

样本数据：

Name	Date	Items
johnny smith	09/1/2021	bread, oranges, peanut butter, apples, celery, peanuts
granny smith	08/31/2021	oranges, peanut butter, apples, bread
jane doe	09/01/2021	oranges, apples, celery, peanut butter
jack frost	08/01/2021	bread, oranges, apples
cinderella	08/16/2021	apples, peanuts, bread

我想要达到的目标：

Name	Date	Items0	Items1	Items2	Items3	Items4	Items5
johnny smith	09/1/2021	bread	oranges	peanut butter	apples	celery	peanuts
granny smith	08/31/2021	bread	oranges	peanut butter	apples
jane doe	09/01/2021		oranges	peanut butter	apples
jack frost	08/01/2021	bread	oranges		apples
cinderella	08/16/2021	bread			apples		peanuts

【问题讨论】：

标签： python pandas pandas-groupby

【解决方案1】：

一种方法是从Items 中的字符串派生分类值，然后将DataFrame 和join 重新格式化：

x = df.pop('Items').str.get_dummies(', ')
df = df.join(
    x.mul(x.columns).set_axis(
        range(len(x.columns)), axis=1
    ).add_prefix('Item')
)

           Name        Date   Item0  Item1   Item2    Item3          Item4    Item5
0  johnny smith   09/1/2021  apples  bread  celery  oranges  peanut butter  peanuts
1  granny smith  08/31/2021  apples  bread          oranges  peanut butter         
2      jane doe  09/01/2021  apples         celery  oranges  peanut butter         
3    jack frost  08/01/2021  apples  bread          oranges                        
4    cinderella  08/16/2021  apples  bread                                  peanuts

说明：使用str.get_dummies 将Item 列转换为分类值：

x = df.pop('Items').str.get_dummies(', ')

   apples  bread  celery  oranges  peanut butter  peanuts
0       1      1       1        1              1        1
1       1      1       0        1              1        0
2       1      0       1        1              1        0
3       1      1       0        1              0        0
4       1      1       0        0              0        1

这可以通过mul通过列名（用列名替换1值），set_axis枚举列（0-项目数）和add_prefix'Item ' 到新编号的列：

x.mul(x.columns).set_axis(
    range(len(x.columns)), axis=1
).add_prefix('Item')

x:

    Item0  Item1   Item2    Item3          Item4    Item5
0  apples  bread  celery  oranges  peanut butter  peanuts
1  apples  bread          oranges  peanut butter         
2  apples         celery  oranges  peanut butter         
3  apples  bread          oranges                        
4  apples  bread                                  peanuts

然后join 回到df 中的其余列。

【讨论】：