Pandas 数据框 groupby 使用独特的组合答案

【问题标题】：Pandas dataframe groupby using unique combinationsPandas 数据框 groupby 使用独特的组合
【发布时间】：2021-09-23 21:01:26
【问题描述】：

我正在尝试分组并仅采用唯一组合，但是它返回重复值并且影响了我的计算

问题：

child	parent	Year	Month	Val	desc
GC1	p1	2021	1	100	group1desc
GC1	p1	2021	1	100	group1desc
GC2	p1	2021	1	200	group1desc
GC2	p2	2021	2	200	group2desc
GC2	p2	2021	2	200	group2desc
GC3	p2	2021	2	300	group2desc
GC3	p2	2021	2	300	group2desc

当我使用DF.groupby(['parent', 'year', 'Month'], as_index=False).agg({'val':'sum','desc':'first', 'child':list})时

它给出：

parent	Year	Month	Val	desc	child
p1	2021	1	400	group1desc	GC1,GC2
p2	2021	2	1000	group2desc	GC2,GC3

我想要的只是唯一的 val，即 GC1 为 P1 添加了一个，P2 = GC2 + GC3（添加了一次）

parent	Year	Month	Val	desc	child
p1	2021	1	300	group1desc	GC1,GC2
p2	2021	2	500	group2desc	GC2,GC3

【问题讨论】：

标签： python pandas dataframe group-by pandas-groupby

【解决方案1】：

让我们尝试用unique + sum 换成Val 和只用unique 换成child：

g = (
    df.groupby(['parent', 'Year', 'Month'], as_index=False)
        .agg({'Val': lambda s: s.unique().sum(),
              'desc': 'first',
              'child': 'unique'})
)

g:

  parent  Year  Month  Val        desc       child
0     p1  2021      1  300  group1desc  [GC1, GC2]
1     p2  2021      2  500  group2desc  [GC2, GC3]

DataFrame 构造函数 (df)：

df = pd.DataFrame({
    'child': ['GC1', 'GC1', 'GC2', 'GC2', 'GC2', 'GC3', 'GC3'],
    'parent': ['p1', 'p1', 'p1', 'p2', 'p2', 'p2', 'p2'],
    'Year': [2021, 2021, 2021, 2021, 2021, 2021, 2021],
    'Month': [1, 1, 1, 2, 2, 2, 2],
    'Val': [100, 100, 200, 200, 200, 300, 300],
    'desc': ['group1desc', 'group1desc', 'group1desc', 'group2desc',
             'group2desc', 'group2desc', 'group2desc']
})

【讨论】：

感谢您的回答。这工作正常并给出了预期的结果。只是想检查当我必须对同一数据帧中的至少 6 个以上字段和 100k + 记录应用此计算时对性能有什么影响
嗨亨利，这个解决方案一直有效，直到我遇到 GC1 = 100 和 GC2 = 100 & GC2 = 100 这个解决方案最终只取 100 作为值。在这些情况下，我需要 GC1 + GC2 = 200。您能否建议如何将此类案例纳入其中。感谢您的帮助
我会说 dm2 的答案就是您要找的。但不要子集drop_duplicates。 df.drop_duplicates().groupby(['parent', 'Year', 'Month'], as_index=False).agg({'Val': 'sum', 'desc': 'first', 'child': list})
如果您最终使用该解决方案，您会不接受这个答案吗？因为我误解了唯一值的子集。

【解决方案2】：

由于您的“desc”聚合是“first”，因此您可以在 groupby 之前删除重复项：

df.drop_duplicates(subset = ['child','parent','Year','Month'])\
.groupby(['parent','Year','Month'], as_index = False)\
.agg({'Val':'sum','desc':'first','child':list})

输出：

    parent  Year    Month   Val desc        child
0   p1      2021    1       300 group1desc  ['GC1', 'GC2']
1   p2      2021    2       500 group2desc  ['GC2', 'GC3']

【讨论】：

好眼光，我想我会从你的回答中借用这个改进，但不太明白:)
感谢您的回答。我在更大的数据集中进行了尝试，它消除了常见的数字，最终结果与预期不符。