【问题标题】:How to aggregate row values if other row values are same in DataFrame?如果 DataFrame 中的其他行值相同,如何聚合行值?
【发布时间】:2021-07-02 00:06:23
【问题描述】:

对于数据框:

>>> df = DataFrame([['2021-03-31', 'A0019', '990RT', 'OFFSET', '0.10'],['2021-03-31', 'A1019', '990CT', 'MARK', '0.10'],['2021-03-31', 'A0019', '990RT', 'M
ARK', '100'],['2021-03-31', 'A0019', '990RT', 'OFFSET', '0.70'],['2021-03-31', 'A0029', '990CT', 'OFFSET', '1.10'],['2021-03-31', 'A0029', '990CT', 'MARK',
 '9.10'],['2021-03-31', 'A0019', '990QT', 'MARK', '99.10'], ['2021-03-31', 'C0019', '990QT', 'OFFSET', '1'], ['2021-03-31', 'C0019', '990QT', 'GHTC', '5'],
['2021-03-31', 'C0019', '990QT', 'OFFSET', '15']], columns=['DATE','A_ID','R_ID','TYPE','I_VAL'] )
>>> df
     DATE       A_ID   R_ID    TYPE  I_VAL
0  2021-03-31  A0019  990RT  OFFSET   0.10
1  2021-03-31  A1019  990CT    MARK   0.10
2  2021-03-31  A0019  990RT    MARK    100
3  2021-03-31  A0019  990RT  OFFSET   0.70
4  2021-03-31  A0029  990CT  OFFSET   1.10
5  2021-03-31  A0029  990CT    MARK   9.10
6  2021-03-31  A0019  990QT    MARK  99.10
7  2021-03-31  C0019  990QT  OFFSET      1
8  2021-03-31  C0019  990QT    GHTC      5
9  2021-03-31  C0019  990QT  OFFSET     15

每个NON OFFSET(例如MARKGHTC)行根据DATE, A_ID, R_ID 的组合唯一匹配零个或多个OFFSET 行。也就是说,NON OFFSET(例如MARK)与OFFSET 行之间存在一对多关系。

我需要分两步完成一个操作:

  1. 如果值DATE, A_ID, R_ID 相同,则聚合行的值。将聚合值作为I_VAL 的值放在NON OFFSET 行中。
  2. 删除带有TYPE OFFSET 的行。

生成的 DataFrame 是:

# The rows with TYPE OFFSET are removed from resulting df.
# Keeping the OFFSET rows for explaining aggregation
# 0, 1, 2, 3, etc. are the indexes (row number) of the rows

    DATE       A_ID   R_ID    TYPE    I_VAL
0  2021-03-31  A0019  990RT  OFFSET   0.10   
1  2021-03-31  A1019  990CT    MARK   0.10   # no update, condition not met
2  2021-03-31  A0019  990RT    MARK   100.80 # updated with sum of 0, self, and 3
3  2021-03-31  A0019  990RT  OFFSET   0.70
4  2021-03-31  A0029  990CT  OFFSET   1.10
5  2021-03-31  A0029  990CT    MARK   10.20  # updated with sum of own value and 4
6  2021-03-31  A0019  990QT    MARK   99.10  # no update, condition not met
7  2021-03-31  C0019  990QT  OFFSET      1   
8  2021-03-31  C0019  990QT    GHTC      21  # updated with sum of self, 7, and 9
9  2021-03-31  C0019  990QT  OFFSET     15

对于第 2 步,我可以:

filtered_df = df[df.TYPE != 'OFFSET']

但是,我不知道如何汇总这些值? 这个post 讨论了一个类似的问题,但我无法根据我的要求对其进行修改。

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    第 1 步:

    首先使用astype()方法将'I_VAL'列的dtype从string改为float

    df['I_VAL']=df['I_VAL'].astype(float)
    

    最后你可以使用groupby() 方法和布尔掩码:

    df.loc[df['TYPE']!='OFFSET','I_VAL']=df.groupby(['DATE','A_ID','R_ID'],as_index=False,sort=False).transform('sum')[df['TYPE']!='OFFSET']['I_VAL']
    

    现在,如果您打印 df,您将获得所需的输出:

    #output
    
          DATE       A_ID   R_ID    TYPE    I_VAL
    0   2021-03-31  A0019   990RT   OFFSET  0.1
    1   2021-03-31  A1019   990CT   MARK    0.1
    2   2021-03-31  A0019   990RT   MARK    100.8
    3   2021-03-31  A0019   990RT   OFFSET  0.7
    4   2021-03-31  A0029   990CT   OFFSET  1.1
    5   2021-03-31  A0029   990CT   MARK    10.2
    6   2021-03-31  A0019   990QT   MARK    99.1
    7   2021-03-31  C0019   990QT   OFFSET  1.0
    8   2021-03-31  C0019   990QT   GHTC    21.0
    9   2021-03-31  C0019   990QT   OFFSET  15.0
    

    对于您的第 2 步:

    使用布尔掩码:

    filtered_df = df[df.TYPE != 'OFFSET']
    

    现在,如果您打印 filtered_df,您将获得所需的输出:

    #output
          DATE      A_ID    R_ID    TYPE    I_VAL
    1   2021-03-31  A1019   990CT   MARK    0.1
    2   2021-03-31  A0019   990RT   MARK    100.8
    5   2021-03-31  A0029   990CT   MARK    10.2
    6   2021-03-31  A0019   990QT   MARK    99.1
    8   2021-03-31  C0019   990QT   GHTC    21.0
    

    【讨论】:

    • 嘿@Anurag Dabas,谢谢。它有帮助。我更改了groupby 构造以在“总和”逻辑中添加多个列:group_by_obj=df.groupby(['DATE','A_ID','R_ID'],as_index=False,sort=False)[['I_VAL', 'P_VAL']].transform('sum')。使用此 group_by_obj 执行其他操作。
    猜你喜欢
    • 2023-04-06
    • 2019-09-15
    • 1970-01-01
    • 2014-11-16
    • 1970-01-01
    • 1970-01-01
    • 2020-10-24
    • 1970-01-01
    • 2021-11-28
    相关资源
    最近更新 更多